CN114663346A

CN114663346A - Strip steel surface defect detection method based on improved YOLOv5 network

Info

Publication number: CN114663346A
Application number: CN202210113743.4A
Authority: CN
Inventors: 石肖松; 刘坤; 杨晓松; 孟蕊
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2022-01-30
Filing date: 2022-01-30
Publication date: 2022-06-24

Abstract

The invention discloses a strip steel surface defect detection method based on an improved YOLOv5 network, which is based on a YOLOv5 network model and is added with a self-designed channel space attention module, thereby improving the detection precision and solving the problem of feature extraction in complex scenes and backgrounds. The detection method provided by the invention gives full play to the advantage of extracting the characteristics by the deep learning method, can learn simple shallow characteristics from a large amount of data set without depending on manual characteristic engineering, and then gradually learn more complex and abstract deep characteristics, and has the advantages of better performance, higher defect type identification precision, high defect accuracy and recall rate of the lithium battery and high identification speed.

Description

Strip steel surface defect detection method based on improved YOLOv5 network

Technical Field

The invention belongs to the technical field of industrial defect detection, and particularly relates to a method for detecting defects on the surface of strip steel of a YOLOv5 network based on a spatial channel attention module.

Background

The strip steel is one of important raw materials of steel, is widely applied to mechanical manufacturing, aerospace and transportation, and plays an important role in various production lives, but in the production process of the strip steel, due to the limitation of industrial technology and the influence of production process, various defects such as surface oil spots, almond-shaped defects, white spots, scratches and the like can be caused on the surface of the strip steel. These defects largely affect the corrosion resistance and the service life of the strip. The existing defect detection means mainly uses manual naked eye detection, has low detection efficiency of workers, high labor intensity and high production cost, and can not meet the requirement of detecting the surface defects of the strip steel.

The deep learning automatically extracts and learns the defect characteristics through the convolutional neural network without designing artificial characteristic factors, so that the deep neural network has the characteristics of strong learning capability and high robustness and gradually becomes a mainstream method for detecting the surface defects of the strip steel. Wengyushang et al (Wengyushang, Xiaojin ball, Xiayuang. improved Mask R-CNN algorithm strip steel surface defect detection [ J/OL ] computer engineering and application: 1-12[2021-06-24 ]) propose an improved Mask region convolution neural network (Mask R-CNN) algorithm, and use a k-means II clustering algorithm to improve a region suggestion network (RPN) anchor frame generation method. Li Wei Steel et al (Li Wei Steel, leaf Xin, Zhao Yuntao, Wang Wen Bo. strip steel surface defect detection based on improved YOLOv3 algorithm [ J ]. electronic newspaper, 2020) proposed a YOLOV3 algorithm framework which fuses shallow features and deep features, and the average precision mean of the improved YOLOv3 algorithm on a strip steel data set of northeast university reaches 80%. However, when weak and tiny strip steel defects are faced, the deep learning model cannot well extract features due to the fact that the background and the foreground are high in coupling degree and small in defect area, and therefore the model detection effect is poor.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to solve the technical problem of providing the strip steel surface defect detection method based on the improved YOLOv5 network, which can detect the surface defects of the strip steel of different types in real time and locate the defects, improves the accuracy of the defect identification of different types and similar structures, and can meet the requirements of real-time property and accuracy of the actual industrial production of the strip steel.

The technical scheme adopted by the invention for solving the technical problems is as follows: a strip steel surface defect detection method based on an improved YOLOv5 network is designed, and is characterized by comprising the following steps:

the first step is as follows: image dataset acquisition

1.1, acquiring a surface image of the strip steel by using an industrial camera, and screening out a picture containing a defect; when the defect type in the screened defect image covers the known type of the surface defect of the strip steel, a defect image set is formed;

1.2, carrying out size normalization operation on the defect picture set, and then manually labeling the pictures in the defect picture set by using Labelimg software to enable each defect picture to have a label of a defect type and a defect position coordinate;

1.3 randomly dividing not less than 60% of the marked defect picture set into a training set, and taking the rest as a verification set;

the second step is that: construction of improved YOLOv5 network model

The improved YOLOv5 network model is characterized in that on the basis of a YOLOv5 network model, three CSP23_ modules of a PAN and three conv modules of a classification and positioning part of the network model are connected in series to form a CSA module;

the CSA module comprises a channel attention module and a space attention module, the two modules are connected in series, and the output of the channel attention module is the input of the space attention module;

CSP23_ Module output feature map F₁Input into the CSA module, first passes through the channel attention moduleThe channel attention module first inputs the feature F₁Respectively carrying out global maximum pooling and global average pooling based on depth and width to obtain two 1 × 1 × C feature maps; secondly, respectively carrying out fast one-dimensional convolution processing on the two obtained 1 multiplied by C characteristic graphs with a convolution kernel size of k, adding results obtained by the two fast one-dimensional convolutions, and then carrying out sigmoid processing on the results to obtain channel attention; multiplying the channel attention by the original feature F₁Carrying out feature re-weighting to obtain weighted feature F₂；

Feature F of channel attention module output₂Input to the spatial attention Module, spatial attention Module to feature F₂Carrying out global maximum pooling and global average pooling respectively to obtain two characteristic graphs of HxWx1; then, performing channel splicing operation on the two H multiplied by W multiplied by 1 feature graphs based on channels, performing convolution operation with a convolution kernel of 7 multiplied by 7 to obtain a result, reducing the dimension into a one-dimensional vector, namely H multiplied by W multiplied by 1, and generating a spatial attention weight through an activation function sigmoid; finally, the spatial attention weight and the input feature F are combined₂Multiplying to obtain the output characteristic F of the space attention module₃(ii) a Characteristic F₃The CSA module is used for classifying and positioning the network model, namely the output of the CSA module and the input of a conv module of the network model;

the third step: training improved Yolov5 network model

3.1 image dataset preprocessing

Preprocessing the training set in a Mosaic data enhancement mode;

3.2 parameter settings

Initializing all weight values, bias values and batch normalization scale factor values, setting the initial learning rate and batch _ size of the network, and inputting initialized parameter data into the network; dynamically adjusting the learning rate and the iteration times according to the change of the training loss so as to update the parameters of the whole network; the training is divided into two stages, the first stage is the first 100 periods of the training, and the initial learning rate is fixed to be 0.001 so as to accelerate convergence; the second stage refers to a training period 100 periods later, and the initial learning rate is set to 0.0001;

3.3 network model training

Inputting the preprocessed training set into an improved YOLOv5 network model with set initialization parameters in the second step for feature extraction, automatically generating an anchor frame for the images of the training set by using a K-means clustering method, taking the size of the anchor frame as a prior frame, and obtaining a boundary frame through frame regression prediction; then, classifying the bounding boxes by using a logistic classifier to obtain defect class classification probability corresponding to each bounding box; sorting the defect type classification probabilities of all the boundary frames by a non-maximum value inhibition method, and determining the defect type corresponding to each boundary frame to obtain a predicted value; the predicted value comprises defect type and defect position information, and the non-maximum inhibition threshold value is 0.5; then calculating a loss value between the predicted value and the true value through a loss function GIOU; performing back propagation according to the training loss value, updating parameters of the backbone network and the classification regression network until the loss value accords with the preset value, and finishing the training of the network model parameters;

3.4 network model testing

Inputting the verification set into the network model which completes the parameter training in the step 3.3 to obtain a tensor prediction value of the verification set; comparing the predicted value of the tensor with the labeling information, and testing the reliability of the network model; evaluating the network model by using the AP, and testing the network model to be reliable when the AP is not less than 85%;

the fourth step: strip steel surface defect detection

And (3) performing the same size normalization operation as in the step 1.2 in the first step on the surface image of the strip steel to be detected, and then inputting the image into the network model tested as reliable in the third step to obtain the defect tensor information of the surface image of the strip steel to be detected, wherein the defect tensor information comprises a defect position, a defect type and a confidence coefficient.

Compared with the prior art, the invention has the beneficial effects that: the detection method is based on a YOLOv5 network model, and a self-designed channel space attention module is added; the channel space attention module performs attention operation by fusing the shallow features and the deep features together, so that the deep layer contains more high-level semantic information and less background information, the target information after the fusion of the shallow features and the deep features is strengthened, more attention target defects of a network can be paid to when attention operation is performed, background information is restrained, multi-scale fusion can be better guided, detection precision is improved, and the problem of feature extraction in complex scenes and under the background is solved. The detection method provided by the invention gives full play to the advantage of extracting the characteristics by the deep learning method, can learn simple shallow characteristics from a large amount of data set without depending on manual characteristic engineering, and then gradually learn more complex and abstract deep characteristics, and has the advantages of better performance, higher defect type identification precision, high defect accuracy and recall rate of the lithium battery and high identification speed.

Drawings

Fig. 1 is a schematic structural and schematic diagram of a CSA module according to an embodiment of the detection method of the present invention.

Fig. 2 is a schematic structural and schematic diagram of an improved YOLOv5 network model according to an embodiment of the detection method of the present invention.

Detailed Description

The technical solutions of the present application will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without making any creative effort belong to the protection scope of the present application.

The invention provides a strip steel surface defect detection method (a detection method for short, see figures 1-2) based on an improved YOLOv5 network, which comprises the following steps:

the first step is as follows: image dataset acquisition

1.2, performing size normalization operation on the defect picture set (scaling to 608 × 608 pixels in this embodiment), and then manually labeling the pictures in the defect picture set by using Labelimg software, so that each defect picture has a label of the defect type and the defect position coordinate;

1.3 randomly dividing not less than 60% of the marked defect picture set into a training set, and taking the rest as a verification set; the present implementation is 4:1, i.e. 80% of the training set, the remaining 20% being the validation set.

The second step is that: construction of improved YOLOv5 network model

The improved YOLOv5 network model is based on the YOLOv5 network model, and one CSA (channel space attention) module is serially connected between three CSP23_ (cross-stage local networks, respectively CSP23_5, CSP23_4 and CSP23_3) modules of PAN (pixel aggregation network) and three conv (convolution) modules of the classification and positioning part of the network model.

The CSA module includes a channel attention (channeltention) module and a spatial attention (SpatialAttention) module, both modules being connected in series, the output of the channel attention module being the input of the spatial attention module.

CSP23_ Module output feature map F₁(CxW x H) is input into the CSA module, and is first processed by the channel attention module, which first inputs the input features F₁(C × W × H) were subjected to global max pooling (Maxpool) and global mean pooling (AvgPool) based on depth and width, respectively, resulting in two C × 1 × 1 feature maps. Secondly, processing the two obtained Cx 1 x 1 feature graphs by a fast one-dimensional convolution (Conv1d) with a convolution kernel of k, adding the results obtained by the two fast one-dimensional convolution, and processing by an activation function sigmoid to obtain channel attention; multiplying the channel attention by the original feature F₁Re-weighting the features to obtain weighted features F₂。

The convolution kernel size k of the fast one-dimensional convolution represents the coverage of local cross-channel interaction, i.e. how many neighbors participate in the attention prediction of the one-dimensional channel. Wherein the coverage of interaction (i.e. kernel size k) is proportional to the channel dimension, and the specific calculation formula is:

where C denotes the number of eigenchannels, β ═ 2 and b ═ 1 denote two superparameters.

Feature F of channel attention module output₂Input to the spatial attention Module, spatial attention Module on feature F₂Respectively carrying out global maximum pooling (Max Pool) and global average pooling (Mean Pool) treatment to obtain two characteristic graphs of 1 multiplied by W multiplied by H; and then performing channel splicing (Concat) operation on the two 1 xWxH feature graphs based on channels, performing convolution operation with a convolution kernel of 7 x 7 on a splicing operation result to reduce the dimension to a one-dimensional vector, namely 1 xWxH, and generating a spatial attention weight through an activation function sigmoid. Finally, the spatial attention weight and the input feature F are combined₂Multiplying to obtain the output characteristic F of the space attention module₃. Characteristic F₃Namely the output of the CSA module and the input of the conv module of the classification and positioning part of the network model.

The third step: training improved Yolov5 network model

3.1 image dataset preprocessing

Preprocessing the training set in a Mosaic data enhancement mode;

3.2 parameter settings

Initializing all weight values, bias values and batch normalization scale factor values, setting the initial learning rate and batch _ size of the network, and inputting initialized parameter data into the network; dynamically adjusting the learning rate and the iteration times according to the change of the training loss so as to update the parameters of the whole network; the training is divided into two stages, the first stage is the first 100 periods of the training, and the initial learning rate is fixed to be 0.001 so as to accelerate convergence; the second stage refers to a training period 100 periods later, and the initial learning rate is set to 0.0001.

3.3 network model training

Inputting the preprocessed training set into an improved YOLOv5 network model with set initialization parameters in the second step for feature extraction, automatically generating an anchor frame for the images of the training set by using a K-means clustering method, taking the size of the anchor frame (the size of the anchor frame can be scaled proportionally according to the scaling size of the images) as a prior frame, and obtaining a boundary frame through frame regression prediction; then, classifying the bounding boxes by using a logistic classifier to obtain defect class classification probability corresponding to each bounding box; sorting the defect category classification probabilities of all the bounding boxes by a non-maximum value suppression (NMS) method, and determining the defect category corresponding to each bounding box to obtain a predicted value; the predicted value comprises defect type and defect position information, and the non-maximum inhibition threshold value is 0.5; then calculating a loss value between the predicted value and the true value through a loss function GIOU; performing backward propagation according to the training loss value, updating parameters of the backbone network and the classification regression network until the loss value accords with the preset value, and finishing the training of the network model parameters;

3.4 network model testing

Inputting the verification set into the network model which completes parameter training in the step 3.3 to obtain a tensor prediction value of the verification set; comparing the predicted value of the tensor with the labeling information, and testing the reliability of the network model; and evaluating the network model by using the AP, wherein the network model is tested to be reliable when the AP is not less than 85%.

The fourth step: strip steel surface defect detection

The same size normalization operation as in step 1.2 of the first step is performed on the surface image of the strip steel to be detected, and then the surface image of the strip steel is input into the network model tested as reliable in the third step, so that the defect tensor information of the surface image of the strip steel to be detected, including the defect position, the defect type and the confidence coefficient (the maximum value of the defect type classification probability), is obtained.

The YOLOv5 model enriches the image content by adopting a Mosaic data enhancement mode on the image of the input end, and improves the detection effect on lighter smoke or smaller smoke areas.

The Mosaic data enhancement has the following characteristics:

firstly, taking a Batch of (Batch refers to Batch, Batch refers to Batch size, and is a hyper-parameter of the model, and the Batch is equal to 32 in the embodiment) images from a training sample set; secondly, randomly selecting 4 images from the batch of images, and randomly operating the four images in a color gamut changing, reducing, reversing and/or cutting mode, wherein at least one operation is performed on each image; then, the 4 images are arranged according to four directions of left upper, left lower, right upper and right lower and then spliced to obtain a new image, and the size of the new image is the same as that of the original image which is not operated, namely 608 multiplied by 3; the above operation is repeated, and the number of cycles is set equal to the Batch size.

The Focus module is connected with a structure consisting of three groups of CBL (convolutional layer Convolution-Batch standardization Batch normalization-activation function Leaky Relu, CBL) Convolution structures and a cross-stage local network CSP1_ X module, and then is connected with a pooling network SPP module to form a feature extraction network; slicing operation is carried out on the image by using a Focus module, the image passes through the Focus module and then is sent into a structure consisting of a CBL convolution structure and a CSP1_ X module, and then the low-level features and the high-level features are fused by using an SPP module.

The Focus module has the following characteristics:

the method comprises the steps of firstly marking an original image with the size of 608 multiplied by 3 by four numbers of 1, 2, 3 and 4, secondly combining pixels with the same number into 4 parts with the size of 304 multiplied by 3, then splicing the 4 parts into a feature map with the size of 304 multiplied by 12 in the depth direction according to the number size, and then connecting a CBL convolution structure.

The characteristics of the CBL convolution structure contained in the Focus module are as follows: the convolution (conv) has a convolution kernel number of 64, a size of 3 × 3, and a step size of 1.

The CSP1_ X module has the following features:

x represents the number of residual error structures, and the structures except the residual error structures are the same.

Taking the CSP1_3 module as an example, the CSP1_3 module firstly performs CBL convolution operation on the input feature map; then, the feature images are sent into 3 residual error structures, convolution is carried out on the feature images passing through the residual error structures, and splicing concat in the depth direction is carried out on new feature images obtained after direct convolution with the input feature images; finally, the data is input into the next module through batch standardization, an activation function Leakyrelu and a layer of CBL convolution structure.

The convolutional layer conv of the CBL convolutional structure directly convolving the input feature map in the CSP1_ X module is the same as the convolutional layer conv in the last CBL convolutional structure in the CSP1_ X module, and the relevant sizes of the convolutional layer conv are as follows: the convolution kernel size is 1 × 1, with a step size of 1.

And (3) designing a feature reprocessing network: the method comprises the following steps of adopting a structure of a feature pyramid FPN and a pixel aggregation network PAN, wherein the FPN structure is formed by connecting two groups of CSP2_3 modules, a CBL convolution structure and an up-sampling up sample structure in series; the PAN comprises two CBL convolution structures, down-sampling the data.

Carrying out depth direction tensor splicing concat on the output of each up-sampling structure in the FPN and the output feature maps of CSP1_9-1 and CSP1_9-2 modules in the feature extraction network, simultaneously carrying out depth direction tensor splicing concat on the output of each CBL convolution structure in the FPN and the feature maps with the corresponding sizes of the CBL convolution structures in the PAN, respectively adding a CSP2_3 module and an SPP module after the PAN structure passes through the CBL convolution structure each time, and adding a CSP2_3 module and an SPP module before the first CBL convolution structure of the PAN structure; neither CSP2_3 nor SPP module changed the feature map size, the output of CSP2_3-3 module in Neck was 76 × 76, and then connected to the CBL convolution structure to change the image size to 38 × 38, and the output of CSP2_3-4 module was connected to the input of the CBL convolution structure to change the image to 19 × 19.

The FPN comprises two CBL convolution structures, but CBL is convolution with step distance of 1, has no influence on the size of the characteristic diagram, is the size of the characteristic diagram changed by up-sampling up sample in the FPN, and learns multi-scale target information.

The cross-stage local network CSP2_3 module has the characteristics of:

the structure of the module is the same as the structure of each residual structure of the CSP1_3 module after the add fusion process is deleted. The CSP2_3 module comprises an initial CBL convolution structure, a tail CBL convolution structure and a plurality of repeating units, wherein 1 x 1 sized convolution layers are connected with 3 x 3 convolution layers through batch normalization and activation functions Leaky relu, the 3 x 3 convolution layers are connected with batch normalization and activation functions Leaky relu to form repeating units, the number of the repeating units is three, the input of the first repeating unit is connected with the output of the initial CBL convolution structure, the output of the three repeating units after being sequentially connected in series is connected with one layer of convolution layer, the output of the convolution layer and the original input of the original convolution layer are spliced after one layer of convolution layer, and the tail CBL convolution structure is connected through batch normalization and activation functions Leaky relu after splicing.

The CBL convolution structure in FPN has the characteristics of: the convolution kernels are all 1 × 1 in size and the steps are all 1.

The CBL convolution structure in PAN has the characteristics: the convolution kernels are all 3 × 3 in size, and the steps are all 2.

The SPP module consists of four parallel pooling layers with maximum pooling kernel sizes of 1 × 1, 5 × 5, 9 × 9, 13 × 13, respectively, and the SPP module itself contains two layers of CBL convolution structures, at the beginning and end.

The output of the YOLOv5 model inherits the idea of YOLOv3, and is detected by adopting 3 scale feature maps, wherein the feature maps are 19 × 19, 38 × 38 and 76 × 76 respectively. And distributing a corresponding number of Anchor boxes for each scale behind each added SPP module, generating 9 Anchor boxes for each pixel in the characteristic diagram, screening out an optimal frame by weighting non-maximum value inhibition, and returning GIOU as a loss function to the network to train parameters.

This embodiment is implemented under a centos7.9.2 platform, and implemented using Python programming, and the computer performance of the test network model and the training network model is as follows: tesla v100, interl Xeon (R) Gold 6271c CPU @2.6 GHz; the framework used is the pytorch deep learning framework. The learning rate of the YOLOv5 model is selected to be λ 0.01, and the training step number is 500 times.

This embodiment adopts AP (Average Precision) and mAP (Average Precision) for evaluation. In target detection, each category corresponds to an accuracy rate and a recall rate. Common evaluation indexes in recent years are AP and mAP, wherein the AP is the area under an accuracy (Precision) -Recall (call) curve, and the P-R curve is drawn by taking Precision as a y axis and Recall as an x axis. The quality of the judging model is mainly determined by the size of the area under the curve. The mAP is the average of the APs of a plurality of categories.

Where TP is the number of positive cases correctly divided into positive cases, FP is the number of positive cases incorrectly divided into negative cases, and FN is the number of negative cases divided into positive cases.

The AP calculation method comprises the following steps:

where p denotes accuracy (precision) and r denotes recall (recall).

In practical statistical situations, accuracy and recall are not continuous curves, independent finite numbers. Therefore, discrete statistics are calculated:

wherein, N represents the total number of images of the data set to be detected, p (k) represents the accuracy rate of the model for identifying k images, and Δ r (k) represents the recall rate change condition of the model when identifying k images to k-1 images.

In target detection, it is usually determined whether the model correctly detects the target by using how much the predicted overlap degree between the mark frame and the real mark frame of the target is, where the overlap degree is also called iou (interaction Over union), a threshold value of IoU is generally set to be 0.5, and if IoU obtained by calculation of the model is greater than 0.5, it is determined that the target is correctly detected, and a diagram IoU is shown in the figure.

The calculation formula is as follows:

wherein, A ^ B represents the overlapping area of the prediction frame and the target frame, and A ^ B represents the union area of the prediction frame and the target frame.

In the embodiment, experiments are carried out on 4 defect images including block defects, scratches, oil spots and white spots on the surface of the strip steel, wherein the identification accuracy rate of the white spots is about 87%, the identification rates of all the other defects are more than 90%, and the identification rates of the two defects with similar structures, namely the block defects and the oil spots, are higher.

Nothing in this specification is said to apply to the prior art.

Claims

1. A strip steel surface defect detection method based on an improved YOLOv5 network is characterized by comprising the following steps:

the first step is as follows: image dataset acquisition

1.2, carrying out size normalization operation on the defect picture set, and then manually labeling the pictures in the defect picture set by using Labelimg software to enable each defect picture to have a label of defect type and defect position coordinates;

the second step: improved YOLOv5 network model

The improved YOLOv5 network model is characterized in that on the basis of a YOLOv5 network model, a CSA module is connected in series between three CSP23_ modules of a PAN and three conv modules of a classification and positioning part of the network model;

the CSA module comprises a channel attention module and a space attention module, wherein the two modules are connected in series, and the output of the channel attention module is the input of the space attention module;

CSP23_ Module output feature map F₁The input is processed by a channel attention module firstly, and the input characteristic F is processed by the channel attention module firstly₁Respectively carrying out global maximum pooling and global average pooling based on depth and width to obtain two Cx 1 x 1 feature maps; secondly, respectively carrying out fast one-dimensional convolution processing on the two obtained Cx 1 x 1 characteristic graphs with a convolution kernel size of k, adding results obtained by the two fast one-dimensional convolutions, and carrying out sigmoid processing on the results to obtain channel attention; multiplying the channel attention by the original feature F₁Re-weighting the features to obtain weighted features F₂；

Feature F of channel attention module output₂Input to the spatial attention Module, spatial attention Module to feature F₂Respectively carrying out global maximum pooling and global average pooling to obtain two characteristic graphs of 1 multiplied by W multiplied by H; then, performing channel splicing operation on the two 1 xWxH feature graphs based on channels, performing convolution operation with a convolution kernel of 7 x 7 on the result obtained by the splicing operation, reducing the dimension to a one-dimensional vector, namely 1 xWxH, and generating a spatial attention weight through an activation function sigmoid; finally, the spatial attention weight and the input feature F are combined₂Multiplying to obtain the output characteristic F of the space attention module₃(ii) a Characteristic F₃The CSA module is the output of the CSA module and is also the input of the conv module of the classification and positioning part of the network model;

the third step: training improved Yolov5 network model

3.1 image dataset preprocessing

Preprocessing the training set in a Mosaic data enhancement mode;

3.2 parameter settings

3.3 network model training

3.4 network model testing

Inputting the verification set into the network model which completes parameter training in the step 3.3 to obtain a tensor prediction value of the verification set; comparing the predicted value of the tensor with the labeling information, and testing the reliability of the network model; evaluating the network model by using the AP, and testing the network model to be reliable when the AP is not less than 85%;

the fourth step: strip steel surface defect detection

And (3) performing the same size normalization operation as in the step 1.2 of the first step on the surface image of the strip steel to be detected, and then inputting the image into the network model tested to be reliable in the third step to obtain the defect tensor information of the surface image of the strip steel to be detected, including the defect position, the defect type and the confidence coefficient.

2. The strip steel surface defect detection method based on the improved YOLOv5 network as claimed in claim 1, wherein the convolution kernel size k of the fast one-dimensional convolution in the channel attention module represents the coverage of local cross-channel interaction, i.e. how many neighbors participate in the attention prediction of the one-dimensional channel; wherein the coverage of the interaction k is proportional to the channel dimension, and the specific calculation formula is:

3. The method as claimed in claim 1, wherein in step 1.2 of the first step, the size normalization is performed to scale the image to 608 × 608 pixels.

4. The method for detecting the surface defects of the steel strip based on the improved YOLOv5 network as claimed in claim 1, wherein in step 1.3 of the first step, the training set is 80%, and the rest 20% is the verification set.