CN115330729A

CN115330729A - Multi-scale feature attention-fused light-weight strip steel surface defect detection method

Info

Publication number: CN115330729A
Application number: CN202210982203.XA
Authority: CN
Inventors: 周锋; 吴瑞琦; 李楠; 郭乃宏; 王如刚
Original assignee: Yancheng Institute of Technology
Current assignee: Yancheng Institute of Technology
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-11-11

Abstract

The invention discloses a light-weight strip steel surface defect detection method with multi-scale feature fusion attention, which comprises the steps of constructing a light-weight YOLOX detection model, introducing a multi-scale feature fusion attention module, performing iterative training on the detection model by using a data set, and selecting the detection model with the highest precision as a strip steel surface defect detection model after convergence, wherein a backbone feature extraction network Darknet of the YOLOX is replaced by CSPDarknet, and K ordinary convolution kernels used for feature extraction are replaced by depth separable convolution kernels of K1 and 1*K, so that the detection speed is increased under the condition of not influencing the precision, in addition, the detection precision of the model, particularly the detection precision of a small target, is improved by adding the multi-scale feature fusion attention module, and the model with the highest precision is selected from the converged model as a final model, so that high-speed, accurate and robust strip steel surface defect detection is realized.

Description

Multi-scale feature attention-fused light-weight strip steel surface defect detection method

Technical Field

The invention relates to the technical field of computer vision, in particular to a light-weight strip steel surface defect detection method with multi-scale features fused with attention.

Background

The application range of the flat steel is quite wide, and the demand in the industrial field is large. In the past decades, automatic Surface Inspection Systems (ASISs) based on machine vision have gained wide attention as a non-contact, non-destructive, and fully automatic solution to assist or replace conventional detectors. In order to reduce labor cost and improve detection efficiency, a traditional mode is replaced by a detection algorithm based on deep learning. Deep learning techniques have been applied to solve many challenging computer vision tasks such as urban traffic, multi-target detection, medical image segmentation, etc. Compared with a manual mode, the multilayer structure of the deep neural network has the advantages that all features are automatically extracted, and the feature extraction performance is stronger. The target detection algorithm based on deep learning has two types: dual stage and single stage. The two-stage network is divided into two steps of generating a suggestion area and classifying images, and the detection precision is high. Common two-stage target detection algorithms include R-CNN, fast R-CNN, faster R-CNN, SPPNet, etc. The single-stage model is directly classified and regressed. Therefore, the single-stage algorithm has high detection speed but low precision, especially overlapping targets and small targets. Common single-stage target detection algorithms are YOLO, SSD, etc.

As productivity increases, the detection devices place higher demands on the algorithms. In the conventional production process, the rolling speed of the flat steel can reach more than 20m/s, and the width can reach 1m. Such high-speed real-time operation requires special image processing equipment and software, and the execution time is short. For this reason, it is necessary to compress the acquired image to simplify information contained in the image. However, the accuracy of defect detection is also affected while compressing the image information. Although the two-stage network has high detection precision, the speed of the two-stage network does not meet the requirement of a high-speed steel rolling production line. Although the detection speed of the single-stage network is fast, the precision is often not high, especially for small targets and stacked targets, and therefore a light-weight strip steel surface defect detection method with multi-scale features and attention fused is urgently needed to solve the problems.

Disclosure of Invention

The invention provides a multi-scale feature attention-fused light-weight strip steel surface defect detection method, which solves the problems that the existing strip steel surface defect detection system is high in precision and low in speed or high in speed and low in precision.

In order to achieve the purpose, the invention provides the following technical scheme: a light-weight strip steel surface defect detection method with multi-scale feature fusion attention comprises the following steps:

s1, constructing a data set marked with surface defects of strip steel to be detected;

s2, replacing a YOLOX trunk feature extraction network Darknet with CSPDarknet, adjusting the repetition times of a residual error module, and constructing a lightweight detection model through depth separable convolution and parameters and calculated quantity of a low-rank decomposition compression residual error module;

s3, adding the multi-scale feature fusion attention module into the detection model;

and S4, performing iterative training on the detection model by using the data set, and selecting the detection model with the highest precision after convergence as the strip steel surface defect detection model.

Preferably, in step S1, a steel surface defect detection dataset image is acquired; extracting an average gray value of a full graph of the data set, carrying out gray filling on the periphery of the image, and filling pixel values from 200 × 200 to 416 × 416; and (5) adapting the tags in the data set, and generating corresponding tags according to filling operation.

Preferably, in step S2, five CSP structures including CSP1-CSP5 are included, wherein the number of repetitions of the residual module in the five CSP structures is adjusted: adjusting the repetition times of the residual blocks from 1, 2, 8 and 4 to 1, 3 and 1; and the output channel numbers of the four CSP modules are scaled to 0.375 times, which are respectively 48, 96, 192 and 384.

Preferably, in step S2, in each residual module, one convolution layer is used as a bottleneck to adjust the number of channels, and K × K common convolution kernels for feature extraction are replaced by K × 1 and 1*K depth separable convolution kernels.

Preferably, in step S3, the multi-scale feature fusion attention module is added between the trunk feature extraction network and the tack to ensure the use of the pre-training weights of the trunk feature extraction network.

Preferably, the adding of the multi-scale feature fusion attention module specifically comprises:

applying attention modules with different scale receptive fields to feature tensors of three resolutions output by a main feature extraction network, and performing feature fusion;

respectively transmitting the obtained fusion features into a space attention module and a channel attention module;

preferably, for the spatial attention module, spatial information is compressed by combining pixel information at the same position on each channel, the number of the channels is adjusted by using a layer of convolution, and the spatial attention module is subjected to feature fusion with the original input features after being activated by Sigmoid;

preferably, for the channel attention module, performing global average pooling and global maximum pooling on each feature map to compress channel information, adding a layer of one-dimensional convolution in front of a full connection layer to perform feature learning, setting the size of a convolution kernel to be 7 to ensure the information interaction rate across channels, activating by using Sigmoid, and finally performing feature fusion with the input feature tensor;

and performing feature fusion on the outputs of the space attention module and the channel attention module to obtain the output of the whole attention module.

Preferably, for the feature tensor of the CSP3, the resolution is the maximum, and the receptive fields of 3*3 and 5*5 in three dimensions are applied;

for the feature tensor of CSP4, the resolution is centered, and the receptive fields of 1*1, 3*3 and 5*5 are applied in three dimensions;

for the feature tensor of CSP5, the resolution is the smallest, imposing a two-scale receptive field of 1*1 and 3*3.

Preferably, in step S4, the learning rate is adjusted by an annealing cosine algorithm, after the loss value stops decreasing, the training is continued for several iterations, and a detection model with the highest precision is selected as the strip steel surface defect detection model.

Compared with the prior art, the invention has the following beneficial effects: in the invention, a light weight design is adopted to optimize a backbone feature extraction network of YOLOX, K x K common convolution kernels used for feature extraction are replaced by the depth separable convolution kernels of K x 1 and 1*K, the detection speed is improved under the condition of not influencing the precision, in addition, the detection precision of the model is improved by adding a multi-scale feature fusion attention module, particularly the detection precision of small targets, the model with the highest precision is selected from the converged models as a final model, and the high-speed, accurate and robust strip steel surface defect detection is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

In the drawings:

fig. 1 is an overall flowchart of a multi-scale feature attention-fused light-weight strip steel surface defect detection method according to an embodiment of the invention.

FIG. 2 is a diagram of a light-weight CSP structure according to an embodiment of the present invention;

FIG. 3 is a diagram of a modified YOLOX model of an embodiment of the invention;

FIG. 4 is a block diagram of an MFFAM according to an embodiment of the present invention;

FIG. 5 is a diagram of a multi-scale receptor field structure according to an embodiment of the present invention;

FIG. 6 is a populated dataset image according to an embodiment of the present invention;

FIG. 7 is a histogram of AP values for a lightweight design model;

FIG. 8 is a histogram comparing AP values of different attention modules.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example (b): as shown in fig. 1, a method for detecting surface defects of light-weight strip steel with multi-scale features fused with attention comprises the following steps:

s1, constructing a data set marked with surface defects of the strip steel to be detected; acquiring a steel surface defect detection data set image; extracting an average gray value of a full graph of the data set, carrying out gray filling on the periphery of the image, and filling pixel values from 200 × 200 to 416 × 416; adapting the tags in the data set, and generating corresponding tags according to filling operation;

s2, replacing a YOLOX trunk feature extraction network Darknet with a CSP domain Partial (Cross Stage Partial) structure which comprises five CSP structures including a CSP1, a CSP2, a CSP3, a CSP4 and a CSP5, adjusting the repetition times of residual error modules in the five CSP structures, and constructing a lightweight detection model through parameters and calculated quantities of deep separable convolution and low-rank decomposition compression residual error modules;

wherein the CSPDarknet obtains better learning advantages through the stacking of CSP structures and residual blocks; the cross-phase local network maximizes the difference of gradient combination, thereby avoiding different convolutional layers from learning the same gradient information; a repeatedly called residual error structure exists in the CSP structure, and the parameters of the model are mainly concentrated in the part;

referring to fig. 2, in order to reduce the number of parameters and the amount of computation on the premise of not affecting the feature extraction performance, in the residual error portion in the module, a convolution layer is used as a bottleneck to adjust the number of channels, and K × K general convolution kernels for feature extraction are replaced by K × 1 and 1*K depth separable convolution kernels, so that the operation speed of the model is further increased, and the model is output after feature fusion is performed;

adjusting the repetition times of the residual modules in the five CSP structures: adjusting the repetition times of the residual blocks from 1, 2, 8 and 4 to 1, 3 and 1; the output channel numbers of the four CSP modules are scaled to 0.375 times of the original output channel numbers which are respectively 48, 96, 192 and 384; the CBS module with the step length of 2 plays a role in down-sampling in the network; referring to FIG. 3, it is a diagram of a modified Yolox model;

s3, according to the detection result of the unoptimized model, the detection effect of the Crazing cracks (Crazing) in all target types is the least ideal, and the AP value is very low; the relative area of the target is small, and the target belongs to a small target; in order to improve the detection precision of the model on the small target, a multi-scale feature fusion attention module is added into a detection module;

the multi-scale feature fusion attention module is added between the trunk feature extraction network and the Neck, and pre-training weight of the trunk feature extraction network is guaranteed to be used so as to avoid training the model from zero. Applying attention between the backbone network and the hack means applying attention modules of different scale receptive fields to feature tensors (feature 1, feature2 and feature 3) of three resolutions output by the backbone feature extraction network, and performing feature fusion; the number of the feature maps is large, the resolution of each feature map is relatively small, the obtained fusion features are respectively transmitted into a space attention module and a channel attention module, for the channel attention module, a layer of one-dimensional convolution is added in front of a full connection layer, and feature fusion is carried out on the outputs of the space attention module and the channel attention module to obtain the output of the whole attention module;

referring to fig. 4, the structure diagram of an MFFAM (Multi-scale Feature Fusion Module) includes 3 sub-modules, where sub-Module 1 is a Multi-scale receptive field structure, and

sub-modules

2 and 3 are two branches of parallel Attention;

referring to fig. 5, for a multi-scale receptive field structure, feature1 (52 × 52), feature2 (26 × 26), and feature3 (13 × 13) are all located at a deep layer of a model, image information is highly extracted and compressed, features are quite abstract, and therefore characterization of an image needs to be enhanced, sub-module 1 realizes perception of different scales through 3 receptive fields of different sizes, and finally performs feature fusion to obtain output features.

If a small receptive field is adopted for the high-resolution feature map, only local information is transmitted into the attention module, and the detection precision of the small target is reduced. If an overlarge receptive field is adopted for the small-resolution feature map, information of other targets is transmitted into the attention module, the training difficulty is improved, and the time required for convergence is prolonged, wherein the CSP-3 transmits feature1, the CSP-4 transmits feature2, and the CSP-5 transmits feature3. Therefore, the receptive fields of 3*3 and 5*5 are adopted for the feature1 with the highest resolution, the receptive fields of 1*1, 3*3 and 5*5 are adopted for the middle feature2, and the receptive fields of 1*1 and 3*3 are adopted for the feature3.

And S4, performing iterative training on the detection model by using the data set, adjusting the learning rate by using an annealing cosine algorithm, continuing training for a plurality of iterations after the loss value stops decreasing, and selecting the detection model with the highest precision as the strip steel surface defect detection model after convergence.

One embodiment is as follows: the teachers at northeast university Song Kechen are adopted to collect and sort NEU surface defect database, which contains six types of typical strip steel surface defects: cracking (Cr), inclusions (In), patches (Pa), recessed surfaces (PS), rolled-In Scale (RS), and Scratches (scaleches, sc); each image is intercepted from the shot picture, the original resolution is 200 pixels by 200 pixels, and 1800 pictures are obtained in total; due to the problems of shooting environment and light, the gray level difference of the pictures in the data set is large; in addition, in practical application, the resolution of the pictures transmitted into the model by the acquisition equipment is higher;

firstly, processing a picture to extract an average gray value of a full data set, carrying out gray filling on the periphery of the image, and filling pixel values from 200 × 200 to 416 × 416; at this time, the average area ratio of the target is reduced to 1/4 of the original data set, and the filled data set image is shown in fig. 6; randomly turning and rotating partial pictures of the data set, generating corresponding tags, expanding the capacity of the data set, and finally, according to the data set volume ratio of 6:2:2, dividing a training set, a verification set and a test set in proportion;

the method is characterized in that a GPU is Tesla V100S, the GPU used for testing is NVIDIA RTX 20606G video memory, based on Tensorflow-GPU 2.3.0 version, the CPU is Intel Core [email protected], the memory adopts DDR42667MHz 16G +16G double channels, the software adopts Pycharm, the development environment is Python3.7, and in order to analyze the influence of the proposed improvement on the model performance, two groups of experiments are designed and respectively used for analyzing the influence of the proposed improvement on the model performanceAnalyzing and comparing the lightweight design and the attention module, wherein each group of experiments adopt the same data set, training hyper-parameters and pre-training weights; the final integral model combines the attention applying mode with the best improvement effect with the light-weight main feature extraction network; during training, the iteration number is set to be 500, the batch size is set to be 64, the initial learning rate is set to be 0.01, and the learning rate is automatically adjusted through a cosine annealing algorithm in the training process; the cosine annealing algorithm is as follows:

wherein, lr _max A maximum learning rate set as an initial learning rate; lr of _min Represents the minimum learning rate, set to 0.0001; e denotes the total number of iterations, E _i Representing the current iteration round.

Taking Recall rate R (Recall), precision P (Precision), average Precision value mAP (mean Average Precision) and the number of processed pictures Per Second FPS (Frames Per Second) as evaluation indexes; selecting 0.5 for the threshold value of IoU in the calculation process, and determining that the target is successfully detected if the threshold value of IoU >0.5 is selected; the recall ratio formula is:

wherein, the mAP adopts a calculation formula after the VOC 2010:

wherein n is 6;

wherein, under the condition that all the hyper-parameters are the same, the compression effect of the improved network model is shown in the following table:

wherein DWC denotes replacing the normal convolution with a depth separable convolution (DepthWiseConv 2D) only in the residual module; LR represents that only the Low Rank is adopted in a residual error module to replace a common convolution kernel; the L-CSP represents a lightweight CSP structure after improvement. After the L-CSP module is introduced, the model has a good compression effect on the volume of the model, the calculated amount is reduced by 36.16%, the number of pictures (416 x 416) which can be processed per second is increased by 10.2, and meanwhile, the mAP is only reduced by 0.1%; referring to FIG. 7, a histogram of AP values for a lightweight design model is shown; it can be seen that, after the module with lower performance is adopted, the AP of part of defect targets is improved, which indicates that overfitting is generated during training; therefore, dropout is introduced in subsequent training, and the robustness of the model is improved; after the L-CSP is adopted, the AP value of each defect target fluctuates slightly; the light weight modules can improve the detection speed of the model while maintaining the detection effect; the data are combined, and the lightweight design of the backbone network is very effective.

The optimization achieved by adding different attention mechanisms is shown in the following table:

wherein the first line Ori represents the network model without applied attention, for comparison. feature1/feature2/feature3 represent the feature tensors of 52 x 52, 26 x 26 and 13 x 13, respectively, of the incoming hack. Application types include 1.SEnet,2.channel Attention,3.spatial Attention,4.CBAM,5.CBAM ^,6.ECANet,7.MFFAM,8.MFFAM ^. CBAM ^ represents on the basis of traditional CBAM, adopt the space and channel to connect the attention module exerted in parallel, and add a layer of one-dimensional convolution before the full connection; MFFAM represents the MFFAM of three identical scales adopted by three feature tensors; MFFAM ^ denotes MFFAM imposing different scales according to the resolution of the feature tensor

In this embodiment, 70-110 iterations are required to converge early when attention is applied at feature1/feature2/feature3, saving training time on average by about 17%. MFFAM and MFFAM ^ have obvious improvement on the detection effect of small target Crazing, mAP is improved by 5.15% compared with a reference model, and the increase of the size and the calculation amount of the model are within an acceptable range, namely 1.51% and 0.5%, respectively, and refer to FIG. 8, which is a comparison graph of AP values of different attention modules.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described above, or equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A light-weight strip steel surface defect detection method with multi-scale feature fusion attention is characterized by comprising the following steps:

s1, constructing a data set marked with surface defects of the strip steel to be detected;

2. The method for detecting the surface defects of the light-weight strip steel with the multi-scale feature fusion attention according to claim 1, characterized by comprising the following steps: in the step S1, acquiring a steel surface defect detection data set image; extracting an average gray value of a full graph of the data set, carrying out gray filling on the periphery of the image, and filling pixel values from 200 × 200 to 416 × 416; and (5) adapting the tags in the data set, and generating corresponding tags according to filling operation.

3. The method for detecting the surface defects of the light-weight strip steel with the multi-scale feature fusion attention according to claim 1, characterized by comprising the following steps: in step S2, five CSP structures including CSP1-CSP5 are included, wherein the number of repetitions of the residual module in the five CSP structures is adjusted: adjusting the repetition times of the residual blocks from 1, 2, 8 and 4 to 1, 3 and 1; and the output channel numbers of the four CSP modules are scaled to 0.375 times, which are respectively 48, 96, 192 and 384.

4. The method for detecting the surface defects of the multi-scale feature fusion attention light-weight strip steel as claimed in claim 1, wherein the method comprises the following steps: in step S2, in each residual module, one convolution layer is used as a bottleneck to adjust the number of channels, and K × K ordinary convolution kernels for feature extraction are replaced by K × 1 and 1*K depth separable convolution kernels.

5. The method for detecting the surface defects of the multi-scale feature fusion attention light-weight strip steel as claimed in claim 1, wherein the method comprises the following steps: in step S3, the multi-scale feature fusion attention module is added between the trunk feature extraction network and the tack to ensure the use of the pre-training weights of the trunk feature extraction network.

6. The method for detecting the surface defects of the multi-scale feature fusion attention light-weight strip steel as claimed in claim 5, wherein the method comprises the following steps: the adding of the multi-scale feature fusion attention module specifically comprises the following steps:

7. The method for detecting the surface defects of the multi-scale feature fusion attention light-weight strip steel as claimed in claim 6, wherein the method comprises the following steps: for the channel attention module, a layer of one-dimensional convolution is added before the full connection layer.

8. The method for detecting the surface defects of the multi-scale feature fusion attention light-weight strip steel as claimed in claim 3, wherein the method comprises the following steps: for the feature tensor of CSP3, the resolution is the maximum, and the receptive fields of 3*3 and 5*5 are applied in three dimensions;

9. The method for detecting the surface defects of the multi-scale feature fusion attention light-weight strip steel as claimed in claim 1, wherein the method comprises the following steps: in step S4, the learning rate is adjusted through an annealing cosine algorithm, after the loss value stops decreasing, the iteration is continued for a plurality of times, and a detection model with the highest precision is selected as a strip steel surface defect detection model.