CN113869412B - Image target detection method combining lightweight attention mechanism and YOLOv network - Google Patents
Image target detection method combining lightweight attention mechanism and YOLOv network Download PDFInfo
- Publication number
- CN113869412B CN113869412B CN202111141568.1A CN202111141568A CN113869412B CN 113869412 B CN113869412 B CN 113869412B CN 202111141568 A CN202111141568 A CN 202111141568A CN 113869412 B CN113869412 B CN 113869412B
- Authority
- CN
- China
- Prior art keywords
- convolution
- attention mechanism
- network
- yolov
- feature extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 69
- 230000007246 mechanism Effects 0.000 title claims abstract description 45
- 238000000034 method Methods 0.000 claims abstract description 61
- 230000008569 process Effects 0.000 claims abstract description 44
- 238000000605 extraction Methods 0.000 claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 17
- 230000006835 compression Effects 0.000 claims description 9
- 238000007906 compression Methods 0.000 claims description 9
- 230000005284 excitation Effects 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 238000013461 design Methods 0.000 claims description 3
- 238000007500 overflow downdraw method Methods 0.000 abstract description 6
- 230000006870 function Effects 0.000 description 11
- 238000013527 convolutional neural network Methods 0.000 description 9
- 238000011160 research Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000005404 monopole Effects 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000283070 Equus zebra Species 0.000 description 1
- 241000283080 Proboscidea <mammal> Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image target detection method combining a lightweight attention mechanism and a YOLOv network, which comprises a training process of a target detection algorithm of the lightweight attention mechanism and the YOLOv network, wherein the lightweight attention mechanism and the YOLOv network are combined by the algorithm to improve the feature extraction capability, a depth separable convolution module is combined into the YOLOv network, the efficiency of the algorithm is improved, the detection precision is further improved, a multiscale fusion method is used in the traditional YOLOv3 network, the feature extraction capability of a model is improved, the performance of the model is further improved, and a target detection method with higher recognition degree is designed by combining the lightweight attention mechanism, the depth separable convolution and the multiscale fusion method into the YOLOv network, so that the task of target detection in an image can be effectively completed, the feature of the image is automatically extracted, and the efficiency is improved, and meanwhile the detection precision is higher.
Description
Technical Field
The invention relates to the technical field of target detection algorithm research in the field of computer vision, in particular to an image target detection method combining a lightweight attention mechanism and YOLOv networks.
Background
Target detection is an important branch of the field of computer vision research and is one of the most fundamental problems in computer vision research. Target detection is the identification of the spatial location or coverage of a specified category (such as people, dogs, zebra, elephants and cars) in some specific images. In addition, in the research of artificial intelligence and information technology, target detection is also important, especially in the aspects of robot vision, face recognition, automatic driving, intelligent monitoring and the like. Challenges encountered in target detection today include both accuracy and efficiency, and how to improve efficiency while ensuring accuracy is a major aspect of current research. According to the existing target detection algorithm, two main classes can be classified, one is a single-stage detection framework (One-stage Detectors) and the other is a Two-stage detection framework (Two-stage Detectors). The monopole type target detection method directly calculates on the complete image to complete detection, and the two-stage type target detection method firstly carries out pretreatment on the image to extract some candidate frames, and then carries out correction on the pretreated image to obtain a final detection result. In contrast, two-stage target detection is more accurate, but slower. The commonly used two-stage target detection method comprises a regional convolution neural network algorithm (R-CNN), a Fast regional convolution neural network algorithm (Fast-R-CNN), a spatial pyramid pooling algorithm (SPP-Net) in a deep convolution network, a multi-region convolution neural network (MR-CNN) and the like. Wherein, R-CNN is used as a basic stone of a two-stage target detection algorithm and is the most representative two-stage target detection algorithm. In the preprocessing stage, a selective search algorithm is utilized to select a candidate frame of interest, and then the spatial position of a target is positioned through a convolutional neural network, a support vector machine and a regression method. In addition, common monopolar target detection includes Single Shot MultiBox Detector (SSD), YOLO, YOLOv2, and YOLOv. Currently, the research directions of researchers on target detection algorithms can be roughly divided into three types, namely, an improved two-stage target detection algorithm, an improved monopole type target detection method and an algorithm combining the monopole type target detection algorithm and the two-stage target detection algorithm. The above algorithm has been shown to perform well in the target detection study, but can be further improved, so we propose a target detection method that combines a lightweight attention mechanism with YOLOv networks.
The R-CNN algorithm is one of two-stage models, and is also the earliest proposed deep learning method for target detection. Firstly, a selective search algorithm is used for searching the feature similarity of adjacent image blocks in an original image, then scoring the similar image blocks, selecting candidate frames of an interested image area to be input into a trained CNN as samples, entering a full-connection layer after feature extraction, and training an SVM classifier and a linear regression prediction model to complete a final target detection task.
Although the R-CNN algorithm has a certain improvement over the conventional target detection method, and the trained CNN after use also has a good effect in the image feature extraction method, the operation time of the algorithm is increased since the generation of the candidate region in the first stage of the R-CNN algorithm is conventional, and when there are a large number of candidate regions in an image, the front propagation calculation of the CNN is multiplied, because each candidate region performs feature extraction once, and the operation time is greatly increased. Thus, the method is applicable to a variety of applications. These repeated operations of the R-CNN algorithm limit the performance of the algorithm.
The Fast-R-CNN algorithm is an improved target detection algorithm based on the R-CNN algorithm, and the main purpose of the Fast-R-CNN algorithm is to optimize the running time of the R-CNN algorithm. Similar to the R-CNN algorithm, the main idea of the Fast-R-CNN algorithm is a method for generating a suggested region, but the difference is that a candidate box is not in an afferent neural network, but is directly used as an input of a convolutional neural network, so that the feature extraction operation is realized. And according to the relation between the region and the extracted features, fusing in a pooling layer. In summary, the most important improvement of Fast-R-CNN is to propose the idea of pooling layers and parallel multitasking training of the region of interest.
The Fast-R-CNN algorithm has several disadvantages:
1) The Fast-R-CNN algorithm is the same as R-CNN, and the interested region needs to be selected and then the feature extraction operation is carried out, so that the process can only be carried out on a CPU, and a great amount of time is wasted.
2) Because of the limitation of the running time, the Fast-R-CNN algorithm cannot be used in real-time application, and the end-to-end training test is not really realized.
The SSD algorithm is a one-stage target detection algorithm, and the feature extractor adopted by the SSD algorithm is a VGG-16 network. When an image is input, the SSD algorithm firstly carries out convolution operation by utilizing a plurality of convolution layers, so that a plurality of feature images with different sizes are obtained, local feature information in the feature images is estimated by utilizing a convolution kernel, and meanwhile, the spatial position information and the classification probability of a target to be detected are also calculated. In addition, the SSD algorithm acts on many position areas of the image and the sizes of bounding boxes of detection results are inconsistent, which causes some redundant boxes to appear, in order to solve the problem, the SSD algorithm also adds a non-maximum suppression technology to merge bounding boxes with high overlapping degree, and also introduces a hard negative sample mining technology to keep the balance of positive and negative samples.
The SSD algorithm has the following disadvantages:
1) When the SSD algorithm performs feature extraction, the features of the features are fewer, and the processing of samples with lower resolution is often not good.
2) The setting of certain parameters in the SSD algorithm is considered to be set and cannot be obtained through training, so that the debugging process is very dependent on experience and has certain randomness, and the generalization capability is poor.
The Fast-R-CNN algorithm is an algorithm obtained by optimizing on the basis of a Fast-R-CNN model, and is therefore also a two-stage target detection method. The method combines the regional suggestion generation module with the Fast-R-CNN module to finish the task of target detection. The Fast-R-CNN module is used for completing the feature mapping of the input image and extracting the features on the basis of the feature mapping. The strategy adopted by the region proposal generating module is a sliding window method, a plurality of candidate regions are generated on the characteristic image after convolution operation, and finally the candidate regions are transferred to a full-connection layer through an ROI pooling layer for final fusion operation. Therefore, the Faster-R-CNN algorithm realizes end-to-end training, and improves the detection efficiency of the model.
The Faster-R-CNN algorithm has the following disadvantages:
1) Because the fast-R-CNN algorithm divides the training process into two phases, the real-time requirement cannot be met in efficiency.
2) For small target detection, the fast-R-CNN algorithm performs poorly, most importantly because its final prediction uses a single deep feature map, resulting in poor generalization ability at different scales.
Disclosure of Invention
The invention aims to provide an image target detection method combining a lightweight attention mechanism and YOLOv networks, which can effectively complete the task of target detection in images, automatically extract the characteristics of pictures and has higher detection precision while improving the efficiency.
In order to achieve the above purpose, the present invention provides the following technical solutions: an image target detection method combining a lightweight attention mechanism and a YOLOv network is characterized by comprising the training process of a target detection algorithm of the lightweight attention mechanism and the YOLOv network:
The training process of the lightweight attention mechanism and the object detection algorithm of YOLOv network is divided into two phases: the first stage is to perform feature extraction on an input image under multiple scales, wherein the feature extraction comprises a residual structure of depth separable convolution and an attention mechanism; the second stage is to fuse the multi-scale features trained in the previous stage and finally output a predicted image, wherein the specific training process is as follows:
the first step: initializing a weight value by a network;
And a second step of: the input image is subjected to multi-scale feature extraction;
and a third step of: under multiple scales, obtaining a downsampled characteristic map of the multiple scales through a depth separable convolution layer and a residual error module of an attention mechanism;
Fourth step: the characteristics of each scale are output and predicted through a convolution layer;
fifth step: and fusing the output predictions of the multi-scale features to form a final prediction model.
Preferably, the method includes a depth separable convolution structure, wherein the depth separable structure is a part of realizing a feature extraction function, and is a key module for realizing a lightweight design, because in standard convolution, convolution operation and combination of feature channels are performed simultaneously, the two parts are separated, namely, the two parts are divided into a depth convolution process and a point convolution process, the calculated amount and the parameter number in the convolution process are reduced greatly through the convolution process after grouping, and the purpose of lightweight is achieved, and further, for input features: d F×DF ×m is first decomposed into two convolution processes, namely a depth convolution process and a point-by-point convolution process, then the calculated amount is written as D K×DK ×1×m depth convolution process and 1×1×m×n point-by-point convolution process, then the calculated amount is written as
O1(n)=DK·DK·M·DF·DF+M·N·DF·DF
For the traditional standard convolution process, the calculated amount under the same input is that
O2(n)=DK·DK·M·N·DF·DF
As a result of the comparison it was found that,
When the convolution kernel size is 3×3, the calculation amount is reduced by nearly 9 times by adopting the depth separable convolution, so that the efficiency of the model is effectively improved.
Preferably, the residual structure of the attention mechanism is another part of the feature extraction process, and is used for improving the feature extraction performance on the backbone network, as an input feature image U, firstly performing a point convolution operation, then performing a depth convolution operation with the size of 3×3 to obtain a graph F after feature extraction, then collecting the attention mechanism SE-Block module to obtain a new graph F1, and finally summing the graphs F and F1 to obtain a final output feature graph V, wherein the attention mechanism can optimize the connection of a channel domain and a spatial domain and can induce the feature extraction network to learn a region of interest,
Wherein: f tr (, θ) represents a convolution mapping operation, specifically:
Ftr:X→U,X∈RH′×W′×C′,U∈RH×W×C
Wherein: f sq (·) represents a squeze operation, i.e. a compression operation, in particular
Wherein: f ex (. Cndot.w) represents the expression operation, i.e. the Excitation operation, in particular
Fex(z,W)=σ(g(z,W))=σ(W2ReLU(W1z))
Wherein: z is the output after the compression operation, the activation function takes Sigmoid, and R is a super parameter and the final output is written as X% = F scale (u, s) = s·u, where u and s are the output of the convolution operation and the output of the excitation operation, respectively.
Preferably, the method comprises the step of adopting a common cross entropy function as the loss function of the prediction model, wherein the specific difference between the predicted value and the true value is calculated by cross entropy, and the cross entropy is expressed as follows:
wherein y represents a real label, y' represents the probability that a sample belongs to a certain class, and in order to further balance the problem of weight distribution of a difficult sample in actual detection, the whole loss function expression of the improved network is as follows:
Preferably, the SE-Block module calibrates the feature relationships in the network through compression and excitation processes, increasing the effective weight and decreasing the ineffective or less effective weight.
Preferably, the depth separable convolution corresponds to the conv2d operation of the operator stage.
Preferably, the residual module of the attention mechanism is bneck operations in the operator stage.
Compared with the prior art, the invention has the following beneficial effects:
The method combines a lightweight attention mechanism and YOLOv networks to improve feature extraction capability, a depth separable convolution module is combined into YOLOv networks, the algorithm efficiency is improved, the detection accuracy is further improved, a multi-scale fusion method is used in a traditional YOLOv network, the feature extraction capability of the model is improved, the performance of the model is further improved, and a target detection method with higher recognition degree is designed by combining the lightweight attention mechanism, the depth separable convolution and the multi-scale fusion method into YOLOv networks, so that the task of target detection in an image can be effectively completed, the features of the image are automatically extracted, and the efficiency is improved, and meanwhile, the detection accuracy is higher.
Drawings
FIG. 1 is a training process for the lightweight attention mechanism and object detection of YOLOv networks of the present invention;
FIG. 2 is a sample of a face object detection image of the present invention;
FIG. 3 is a diagram of a depth separable convolution structure of the present invention;
FIG. 4 is a residual structure of the attention mechanism of the present invention;
FIG. 5 is a schematic view of the SE-Block structure;
FIG. 6 shows the variation curves of different models of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-6, a method for detecting an image target by combining a lightweight attention mechanism and YOLOv networks includes a training procedure of the lightweight attention mechanism and a target detection algorithm of YOLOv networks:
The training process of the lightweight attention mechanism and the object detection algorithm of YOLOv network is divided into two phases: the first stage is to perform feature extraction on an input image under multiple scales, wherein the feature extraction comprises a residual structure of depth separable convolution and an attention mechanism; the second stage is to fuse the multi-scale features trained in the previous stage and finally output a predicted image, wherein the specific training process is as follows:
the first step: initializing a weight value by a network;
And a second step of: the input image is subjected to multi-scale feature extraction;
and a third step of: under multiple scales, obtaining a downsampled characteristic map of the multiple scales through a depth separable convolution layer and a residual error module of an attention mechanism;
Fourth step: the characteristics of each scale are output and predicted through a convolution layer;
fifth step: and fusing the output predictions of the multi-scale features to form a final prediction model.
In this embodiment, the method includes a depth separable convolution structure, where the depth separable structure is a part of realizing a feature extraction function, and is a key module for realizing a lightweight design, because in standard convolution, convolution operation and combination of feature channels are performed simultaneously, and the two parts are performed separately, that is, the two parts are divided into a depth convolution process and a point convolution process, and by the convolution process after grouping, the calculation amount and the parameter number in the convolution process are reduced greatly, so that the purpose of lightweight is achieved, and further, for input features: d F×DF ×m is first decomposed into two convolution processes, namely a depth convolution process and a point-by-point convolution process, then the calculated amount is written as D K×DK ×1×m depth convolution process and 1×1×m×n point-by-point convolution process, then the calculated amount is written as
O1(n)=DK·DK·M·DF·DF+M·N·DF·DF
For the traditional standard convolution process, the calculated amount under the same input is that
O2(n)=DK·DK·M·N·DF·DF
As a result of the comparison it was found that,
When the convolution kernel size is 3×3, the calculation amount is reduced by nearly 9 times by adopting the depth separable convolution, so that the efficiency of the model is effectively improved.
In this embodiment, the residual structure of the attention mechanism is another part of the feature extraction process, and is used to improve the feature extraction performance on the backbone network, as an input feature image U, firstly perform a point convolution operation, then perform a depth convolution operation with a size of 3×3 to obtain a graph F after feature extraction, then aggregate the attention mechanism SE-Block to obtain a new graph F1, and finally sum the graphs F and F1 to obtain a final output feature graph V, where the attention mechanism can optimize the association of the channel domain and the spatial domain, and can induce the feature extraction network to learn the region of interest,
Wherein: f tr (, θ) represents a convolution mapping operation, specifically:
Ftr:X→U,X∈RH′×W′×C′,U∈RH×W×C
Wherein: f sq (·) represents a squeze operation, i.e. a compression operation, in particular
Wherein: f ex (. Cndot.w) represents the expression operation, i.e. the Excitation operation, in particular
Fex(z,W)=σ(g(z,W))=σ(W2ReLU(W1z))
Wherein: z is the output after the compression operation, the activation function takes Sigmoid, and R is a super parameter and the final output is written as X% = F scale (u, s) = s·u, where u and s are the output of the convolution operation and the output of the excitation operation, respectively.
In this embodiment, including the loss function of the prediction model, we use a common cross entropy function as the loss function of the prediction model, and the difference between the specific predicted value and the actual value uses cross entropy calculation, where the expression of the cross entropy is as follows:
wherein y represents a real label, y' represents the probability that a sample belongs to a certain class, and in order to further balance the problem of weight distribution of a difficult sample in actual detection, the whole loss function expression of the improved network is as follows:
In this embodiment, the SE-Block module calibrates the feature relationships in the network by compression and excitation processes, increasing the effective weight and decreasing the ineffective or less effective weight.
In this embodiment, the depth separable convolution corresponds to the conv2d operation of the operator stage.
In this embodiment, the residual module of the attention mechanism is then bneck operations in the operator stage.
The light-weight attention mechanism is combined with YOLOv networks to improve the feature extraction capability, the depth separable convolution module is combined with YOLOv networks, the algorithm efficiency is improved, the detection precision is further improved, the multi-scale fusion method is used in the traditional YOLOv networks, the feature extraction capability of the model is improved, the performance of the model is further improved, the light-weight attention mechanism, the depth separable convolution and the multi-scale fusion method are combined with YOLOv networks, a target detection method with higher recognition degree is designed, the task of target detection in an image can be effectively completed, the features of the image are automatically extracted, and the detection precision is higher while the efficiency is improved.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (5)
1. An image target detection method combining a lightweight attention mechanism and a YOLOv network is characterized by comprising the training process of a target detection algorithm of the lightweight attention mechanism and the YOLOv network:
The training process of the lightweight attention mechanism and the object detection algorithm of YOLOv network is divided into two phases: the first stage is to perform feature extraction on an input image under multiple scales, wherein the feature extraction comprises a depth separable convolution structure and a residual structure of an attention mechanism; the second stage is to fuse the multi-scale features trained in the previous stage and finally output a predicted image, wherein the specific training process is as follows:
the first step: initializing a weight value by a network;
And a second step of: the input image is subjected to multi-scale feature extraction;
and a third step of: under multiple scales, obtaining a downsampled characteristic map of the multiple scales through a depth separable convolution layer and a residual error module of an attention mechanism;
Fourth step: the characteristics of each scale are output and predicted through a convolution layer;
Fifth step: fusing the output predictions of the multi-scale features to form a final prediction model,
The depth separable convolution structure is a part for realizing the feature extraction function, and is a key module for realizing the lightweight design, because in standard convolution, convolution operation and combination of feature channels are performed simultaneously, the two parts are separated, namely the depth convolution and the point convolution process, the calculated amount and the parameter number in the convolution process are reduced greatly through the convolution process after grouping, and the purpose of lightweight is further achieved, and for the input features: d F×DF ×m is first decomposed into two convolution processes, namely, a depth convolution process and a point-by-point convolution process, then the calculated amount is written as D K×DK ×1×m depth convolution process and 1×1×m×n point-by-point convolution process, namely, the calculated amount is
O1(n)=DK·DK·M·DF·DF+M·N·DF·DF,
Wherein the residual structure of the attention mechanism is another part of the feature extraction process, and is used for improving the feature extraction performance on the backbone network, as an input feature image U, firstly performing point convolution operation, then performing depth convolution operation with the size of 3×3 to obtain a graph F after feature extraction, then collecting the attention mechanism SE-Block module to obtain a new graph F1, and finally summing the graphs F and F1 to obtain a final output feature graph V, specifically, the attention mechanism can optimize the connection of a channel domain and a space domain and can induce the feature extraction network to learn the region of interest,
Wherein: f tr (, θ) represents a convolution mapping operation, specifically:
Ftr:X→U,X∈RH′×W′×C′,U∈RH×W×C
Wherein: f sq (·) represents a squeze operation, i.e. a compression operation, in particular
Wherein: f ex (. Cndot.w) represents the expression operation, i.e. the Excitation operation, in particular
Fex(z,W)=σ(g(z,W))=σ(W2ReLU(W1z))
Wherein: z is the output after the compression operation, the activation function takes Sigmoid, and R is a superparameter and the final output is written as/>Where u and s are the output of the convolution operation and the output of the excitation operation, respectively.
2. The method for detecting an image object by combining a lightweight attention mechanism and YOLOv networks according to claim 1, wherein the method comprises a loss function of a prediction model, a common cross entropy function is adopted as the loss function of the prediction model, a specific difference between a predicted value and a true value is calculated by adopting cross entropy, and the cross entropy is expressed as follows:
wherein y represents a real label, y' represents the probability that a sample belongs to a certain class, and in order to further balance the problem of weight distribution of a difficult sample in actual detection, the whole loss function expression of the improved network is as follows:
3. The method of claim 1, wherein the SE-Block module calibrates the feature relationships in the network by compression and excitation processes to increase the effective weight and decrease the ineffective or less effective weight.
4. The method of image object detection in combination with a lightweight attention mechanism and YOLOv network of claim 1, wherein the depth separable convolution corresponds to the operator phase conv2d operation.
5. The method for image object detection in combination with a lightweight attention mechanism and YOLOv network as recited in claim 1, wherein the residual module of the attention mechanism is then bneck operation of the operator stage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111141568.1A CN113869412B (en) | 2021-09-28 | 2021-09-28 | Image target detection method combining lightweight attention mechanism and YOLOv network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111141568.1A CN113869412B (en) | 2021-09-28 | 2021-09-28 | Image target detection method combining lightweight attention mechanism and YOLOv network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113869412A CN113869412A (en) | 2021-12-31 |
CN113869412B true CN113869412B (en) | 2024-06-07 |
Family
ID=78991824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111141568.1A Active CN113869412B (en) | 2021-09-28 | 2021-09-28 | Image target detection method combining lightweight attention mechanism and YOLOv network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113869412B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115063478B (en) * | 2022-05-30 | 2024-07-12 | 华南农业大学 | Fruit positioning method, system, equipment and medium based on RGB-D camera and visual positioning |
CN117809294B (en) * | 2023-12-29 | 2024-07-19 | 天津大学 | Text detection method based on feature correction and difference guiding attention |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112101434A (en) * | 2020-09-04 | 2020-12-18 | 河南大学 | Infrared image weak and small target detection method based on improved YOLO v3 |
CN112232214A (en) * | 2020-10-16 | 2021-01-15 | 天津大学 | Real-time target detection method based on depth feature fusion and attention mechanism |
CN112396002A (en) * | 2020-11-20 | 2021-02-23 | 重庆邮电大学 | Lightweight remote sensing target detection method based on SE-YOLOv3 |
WO2021139069A1 (en) * | 2020-01-09 | 2021-07-15 | 南京信息工程大学 | General target detection method for adaptive attention guidance mechanism |
-
2021
- 2021-09-28 CN CN202111141568.1A patent/CN113869412B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021139069A1 (en) * | 2020-01-09 | 2021-07-15 | 南京信息工程大学 | General target detection method for adaptive attention guidance mechanism |
CN112101434A (en) * | 2020-09-04 | 2020-12-18 | 河南大学 | Infrared image weak and small target detection method based on improved YOLO v3 |
CN112232214A (en) * | 2020-10-16 | 2021-01-15 | 天津大学 | Real-time target detection method based on depth feature fusion and attention mechanism |
CN112396002A (en) * | 2020-11-20 | 2021-02-23 | 重庆邮电大学 | Lightweight remote sensing target detection method based on SE-YOLOv3 |
Also Published As
Publication number | Publication date |
---|---|
CN113869412A (en) | 2021-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111259786A (en) | Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video | |
CN104484890B (en) | Video target tracking method based on compound sparse model | |
CN113869412B (en) | Image target detection method combining lightweight attention mechanism and YOLOv network | |
CN110610210B (en) | Multi-target detection method | |
CN111898432A (en) | Pedestrian detection system and method based on improved YOLOv3 algorithm | |
CN117252904B (en) | Target tracking method and system based on long-range space perception and channel enhancement | |
CN117496384B (en) | Unmanned aerial vehicle image object detection method | |
CN111738074B (en) | Pedestrian attribute identification method, system and device based on weak supervision learning | |
CN114882351A (en) | Multi-target detection and tracking method based on improved YOLO-V5s | |
CN115731517B (en) | Crowded Crowd detection method based on crown-RetinaNet network | |
Guo et al. | ANMS: attention-based non-maximum suppression | |
CN115880660A (en) | Track line detection method and system based on structural characterization and global attention mechanism | |
Li et al. | Research on YOLOv3 pedestrian detection algorithm based on channel attention mechanism | |
Li et al. | PaFPN-SOLO: A SOLO-based image instance segmentation algorithm | |
Guo et al. | Udtiri: An open-source road pothole detection benchmark suite | |
Ajith et al. | Pedestrian detection: performance comparison using multiple convolutional neural networks | |
Narmadha et al. | Robust Deep Transfer Learning Based Object Detection and Tracking Approach. | |
Jain et al. | An improved traffic flow forecasting based control logic using parametrical doped learning and truncated dual flow optimization model | |
CN114463732A (en) | Scene text detection method and device based on knowledge distillation | |
Ranjbar et al. | Scene novelty prediction from unsupervised discriminative feature learning | |
Li et al. | Pedestrian Motion Path Detection Method Based on Deep Learning and Foreground Detection | |
Tang et al. | Rapid forward vehicle detection based on deformable Part Model | |
Lu | Deep learning for object detection in video | |
Li et al. | Research on efficient detection network method for remote sensing images based on self attention mechanism | |
Wang et al. | A fast-training GAN for coal–gangue image augmentation based on a few samples |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |