CN116205832A

CN116205832A - Metal surface defect detection method based on STM R-CNN

Info

Publication number: CN116205832A
Application number: CN202111430299.0A
Authority: CN
Inventors: 王卫; 张新凯; 于波
Original assignee: Shenyang Institute of Computing Technology of CAS
Current assignee: Shenyang Institute of Computing Technology of CAS
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2023-06-02

Abstract

The invention provides a metal surface defect detection method based on STM R-CNN, which uses a Swin Transformer as a backbone feature extraction network, uses a Mix-FPN mixed feature pyramid as a feature extraction layer, designs a metal surface image detection algorithm of a cascade region convolutional neural network frame, and applies a Transformer structure to the field of metal surface defect detection. Firstly, a Swin Transformer is used as a backbone feature extraction network to replace a conventional residual network structure, so that the extraction capability of the feature network to deep semantic information hidden in an image is enhanced. Then designing a Mix-FPN mixed feature pyramid network, fusing different feature layer information through a feature pyramid, and then designing a Multi-stage R-CNN cascade structure: each stage is focused on detecting Region Proposal within a specific range by a different IoU threshold. Finally, soft non-maximum suppression (Soft-NMS) and FP16 hybrid accuracy training is used to optimize and improve model performance.

Description

Metal surface defect detection method based on STM R-CNN

Technical Field

The invention relates to application of a deep learning algorithm to traditional industrial metal defect detection, in particular to a metal surface defect detection method based on STM R-CNN.

Background

Industrial metal profiles are widely applied to various industrial building engineering constructions, are also main raw materials of large-scale steel structures, play a role in various application scenes, and are used as industrial basic materials, the production quality of the metal profiles affects the safety, quality and service life of engineering construction at any time, and particularly in some high-precision industrial application scenes, the quality and appearance of the materials are extremely harsh. However, due to the limitations of steel production process and production environment, various problems of the steel are unavoidable, and in order to meet the requirements of various industrial application scenes on different standards of the steel, corresponding quality detection of the steel is generated, and the problem of detecting the surface defects of the hot-rolled strip steel is mainly solved. The industrial hot rolled strip steel is coiled strip steel with the thickness of 0.1-2 cm and the width of 60-200 cm, is widely applied to the production and manufacturing scenes of industrial equipment such as automobiles, shipbuilding, electrical equipment, engineering construction and the like as raw materials, and the service life and the value of subsequent products are directly influenced by the integrity and no flaws of the strip steel surface, so the strip steel surface defect detection is an important link in the production quality detection of the hot rolled strip steel.

The traditional steel surface detection method comprises manual detection, magnetic powder detection, penetration detection, eddy current detection, X-ray detection, ultrasonic detection technology and machine vision detection. The manual detection efficiency is low, the ultrasonic waves are not suitable for materials with complex surfaces, the traditional machine vision detection method has the problems of difficult realization, high equipment precision requirement, higher design threshold, unfriendly experiment designers, high maintenance cost, large environmental impact and the like.

With the rapid development of deep learning in recent years, the characteristic extraction characteristic of a convolution operator enables the technology of image processing to be greatly developed, the device limitation for the digital image processing technology is relieved along with the rapid development of moore's law, the rapid capability improvement is brought, the more powerful GPU can be applied to large-scale parallel computing, and the problem of computing force for limiting the deep learning in the past is solved. The method brings revolutionary promotion to the traditional digital image processing, and the target detection method based on the image processing is rapidly applied and developed.

The deep learning-based object detection algorithms can be currently structurally divided into two-stage algorithms and one-stage algorithms, and are represented by the Faster-RCNN, the Yolo series, and SSD, respectively. The method has good performance in the aspect of metal surface flaw detection, the detection effect can basically meet the industrial application requirement, but with the updating iteration of the technology, an advanced method is more needed to further improve the quality and accuracy of flaw detection.

Disclosure of Invention

Aiming at the problems existing in the prior art: the defects on the surface of the metal section are various and similar in characteristics, the defects are large in shape difference and area difference, the algorithm is difficult to quickly converge under the condition of small data sets, and accurate detection is difficult to realize. The invention provides a metal surface defect detection algorithm based on STM R-CNN to solve the problems.

The technical scheme adopted by the invention for achieving the purpose is as follows:

a metal surface defect detection method based on STM R-CNN comprises the following steps:

s1, acquiring metal surface image data, enhancing the data, classifying tags, and establishing a pairing data set with classified tags;

s2, establishing a metal surface defect detection network of STM R-CNN, wherein the metal surface defect detection network comprises the following 4 network modules: the backbone feature extraction network module is used for extracting features with different dimensions from input data by adopting a transducer operation unit; the Mix-FPN mixed feature extraction network module is used for carrying out further feature mixing on feature graphs with different dimensions to obtain enhanced features; the RPN network module is used for carrying out iterative training on the enhanced features and outputting an interested region and a defect boundary prediction frame; the Multi-stage R-CNN Multi-cascade detection network module is used for carrying out iterative training by combining with flexible non-maximum suppression on the region of interest output by the RPN network module, and outputting a defect boundary prediction frame and a prediction classification label step by step;

s3, acquiring metal surface image data in real time, inputting a metal surface defect detection network for establishing STMR-CNN, automatically positioning a defect boundary prediction frame and outputting a prediction classification label.

The label classification is manual classification of defects.

The backbone feature extraction network module extracts features of different dimensions from input data, including:

1) Dividing an original image of H multiplied by W multiplied by 3 into 4 multiplied by 4 image block patches, flattening the image block patches into linear dimensions, and adding pixel positions of each image block patch in the image;

2) And inputting the sequentially connected transformers to obtain 4 characteristic graphs stage 1-stage 4 with different dimensions respectively.

And the Mix-FPN mixed feature extraction network module outputs 5 enhancement feature graphs of p 1-p 5 by adopting a cross-layer cross data fusion method to the 4 feature graphs of stage 1-stage 4.

The cross-layer cross data fusion method comprises the following steps:

a. adopting T4+T2, T3+T1, T4+T2+T3 and T3+T1+T4 feature fusion, and then carrying out convolution operation to output p 1-p 4 feature information;

b. performing convolution operation of 3×3 and stride=2 on stage4 to obtain feature information p5;

the T1-T4 are respectively obtained by transforming the characteristic graphs stage 1-stage 4 with 4 different dimensions output by the backbone characteristic extraction network module through a 1X 1 convolution channel.

The t4+t2, t3+t1, t4+t2+t3, t3+t1+t4 feature fusion comprises:

1) 4 times of nearest neighbor up-sampling is carried out on T4, then add operation is carried out on the T2, and a new product is obtained

2) Will be new

4 times nearest neighbor up-sampling is carried out, then add operation is carried out with T1, and new +.>

3) Will be new

Performing 8 times random downsampling, and then performing add operation with T4 to obtain new +.>

4) Will be new

Performing 2-fold random downsampling, and then performing add operation with T3 to obtain new +.>

5) For new fused

The respective 3 x 3 convolutions are performed to obtain the final outputs p1 to p4 of 4 scales.

The loss functions of the RPN network module and the Multi-stage R-CNN Multi-cascade detection network module are classified cross entropy loss and bounding box regression loss.

The invention has the following beneficial effects and advantages:

1. the most advanced feature extraction backbone architecture based on a transducer at present is adopted as a basic feature extraction network, so that the feature extraction capability of the enhanced long semantic information is realized;

2. designing a Mix-FPN mixed feature pyramid network (Mixed dense feature pyramid networks, mix-FPN) framework, and enhancing the characteristic of adaptation of an algorithm to large-scale change of a detection target by mixing high-low layer feature semantic information;

3. designing a Multi-stage R-CNN Multi-cascade structure, and realizing a Multi-threshold gradual rise strategy through cascade R-CNN detection stages to improve detection accuracy;

4. and the rapid convergence is realized by adopting soft non-maximum suppression and FP16 mixed precision training, so that the detection accuracy is improved, and the training time is shortened.

Drawings

FIG. 1 is a flow chart of a metal surface defect detection algorithm of STM R-CNN of the present invention;

FIG. 2 is a schematic diagram of the Swin transducer data processing process of the present invention;

FIG. 3 is a diagram of a Swin transducer backbone algorithm model employed in the present invention;

FIG. 4 is a diagram of a Mix-FPN hybrid feature pyramid extraction algorithm model of the present invention;

FIG. 5 is a block diagram of a regional suggestion network in accordance with the present invention;

FIG. 6 is a diagram of a Multi-stage R-CNN Multi-cascade detection network R-CNN algorithm according to the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

As shown in fig. 1, the metal surface defect detection algorithm based on STM R-CNN proposed herein is mainly divided into four parts, and its specific functions are as follows:

(1) Swin transducer backbone network: the transformation former is developed by a self-intent operation unit, is different from a classical convolution operation unit structure in convolution nerves, has prominent long semantic information acquisition capability in natural language processing technology, and has the trend of replacing the latter and unifying two fields of computer vision and natural language processing;

(2) Mix-FPN hybrid feature extraction network: feature map fusion is carried out on the feature maps output by the backbone network according to a mixed mode, five fused feature maps are obtained, the feature fusion is fused with the bottom layer according to high-level semantic information, inter-layer fusion rules are crossed, cross mixing of the feature information is achieved to the greatest extent, and the characterization capability of the feature images is enhanced;

(3) Multi-stage R-CNN layer: the single-stage detection threshold adopted in the traditional mode is limited by a single condition, the unification of the detection precision and the lifting threshold cannot be realized, the further lifting of the detection precision is severely limited, the multi-stage mode is adopted, the linear constraint relation between the threshold and the detection precision can be effectively avoided through the setting of the multi-threshold, the stronger relation fitting is realized, and the detection precision is further improved;

(4) Soft-NMS: there is a case where the flexible non-maximum suppression (NMS) algorithm forces the score of the adjacent detection box to zero, and there will be a case where the overlapping real objects are forced to zero, resulting in detection failure and a drop in average detection accuracy (average preciusion, AP).

Traditional NMS reset function:

s _i to detect the frame score, iou (M, b _i ) As the cross-ratio function of the real frame and the detection frame, N _t An overlap threshold value set; the overlapping detection frames are preserved by reducing the score of the overlapping detection frames rather than the forced zeroing operation using a Soft non-maximum suppression (Soft-NMS) algorithm.

Soft-NMS reset function:

when the detection frames exceed the overlapping threshold, the score of the detection frames is reset to linearly attenuate, the attenuation degree of the detection frames with the shorter distance from M is increased, and the influence degree of the detection frames with the longer distance is smaller.

As shown in fig. 2 and 3, the Swin Transformer backbone network data processing algorithm directly divides an image into batches of fixed size, obtains batch Embedding by Linear transformation, and performs operations such as feature extraction classification after serializing the image into transformers, similar to word Embedding (Linear Embedding) in natural language processing. The Swin transform backbone network firstly cuts the H×W×3pix picture into 4×4 picture blocks (patches), flattens the patches into linear dimensions, converts the linear dimensions into token emudding, and adds the position emudding on the basis of embedding tokens into the token emudding. It is input to a custom number of Transformer Encoder modules. (Each (H/4) × (W/4) ×3pix patch represents a token.)

Backbone is 4 total stages, each stage outputs a feature map, and the size of the output feature map is (c=96):

①：

②：

③：

④：

as shown in fig. 4, the backup outputs 4 feature maps stage1, stage2, stage3, and stage4, which are converted into T1 to T4 through 1×1 convolution channels, and the channels are unified to 256.

Starting from T4, performing 4 times nearest neighbor up-sampling, and then performing add operation with T2 to obtain a new sample

Will be new

Will be new

Will be new

For new fused

All carry on the respective 3X 3 convolution, get final output p 4-p 1 of 4 scales;

to provide a very large receptive field feature map, large scale features were detected and stage4 was convolved 3 x 3 with stride=2 to yield p5. The Mix-FPN module realizes stage1 to stage4,4 feature map inputs, p1 to p5 and 5 feature map output sizes:

p1：

p2：

p3：

p4：

p5：

as shown in fig. 5 and 6, the feature map from the feature extraction network enters the RPN network, and is first convolved by 3×3 and then convolved by 1×1, respectively, to generate classification and bounding box prediction, wherein the classification prediction is mainly used for two classification prediction foreground and background, and the bounding box is used as the input of the Multi-stage R-CNN cascade network. The result of the simultaneous prediction is combined with the anchor frame generator Anchors Generator to generate the region pre-selection frame and label for loss calculation.

The loss functions of the RPN and the Multi-stage R-CNN network are as follows:

in the formula (3), i represents a prediction frame index anchor index, p _i Representing the prediction probability that the i-th prediction box anchor is a true label,

the representative corresponding positive sample is 1, and the negative sample is 0, so that no boundary box regression loss is ensured when the anchor is the negative sample. t is t _i Boundary box regression value representing the i-th anchor of the prediction, +.>

Representing the corresponding ith real frame value, and calculating the offset of the anchor and the real frame. N (N) _cls Minimum batch size, N _reg Is the number of prediction boxes Anchor Location. L (L) _cls For cross entropy loss, L _reg Is SmoothL1Loss.

Equation (3), the loss function of the RPN network is a classification cross entropy loss and a bounding box regression loss.

(1) Classification cross entropy loss formula (Cross EntropyLoss): the classifier in the RPN network divides the candidate box into foreground and background, which is a two-classification problem. The prediction result is only two p and 1-p, the formula:

wherein p is _i Representing the probability that the i-th anchor predicts as a true label,

1 when positive and 0 when negative;

(2) multiple classification functions of the cross-class entropy loss will be used in the R-CNN module:

where M is the number of categories, y _jc Is a sign function (0 or 1), if the true class of sample j is equal to c, taking 1, otherwise taking 0; p is p _jc Is the predicted probability that observation sample j belongs to category c.

(3) Bounding box regression loss:

wherein in the formulas (6-4, 6-5), x, y, w, h respectively represent the coordinate positions of the real (tag) frames, x _a ,y _a ,w _a ,h _a Respectively representing coordinates of the anchor prediction frame. Using formula (7)

And (5) returning the loss function, and calculating the loss functions of the anchor prediction frame and the real frame.

As shown in fig. 6, the data set of the present invention adopts a hot rolled steel strip public data set, and has been subjected to label classification, the classification includes: six types of scale (RS), plaque (Pa), crack (Cr), pitted Surface (PS), inclusion (In) and scratch (Sc), 300 sheets each, totaling 1800 sheets.

The traditional R-CNN network has high resistance to detection results no matter how the threshold value is set. If the threshold is set high, the predicted binding box (x) and the real binding box (y) contain many contexts, making it difficult for the network to obtain positive sample data. If the threshold is lower, the network may obtain more positive samples, but may contain more non-real samples. It is therefore difficult to implement the setting of the threshold value by a single network model. To improve the detection capability of the model, the threshold of the detector module is continuously increased by constructing a cascade architecture, respectively (0.55,0.65,0.75). By resampling using the regression output of the previous stage, some extremes are removed by increasing the IoU threshold, optimizing the deep detector, improving overall performance. Intersection ratio (IoU), ratio calculation of intersection and union of prediction block (bounding box predict) and real block (real bounding box):

the predicted value of the binding box is obtained through IoU of the RPN network, the predicted value is sent to a first stage of the R-CNN network, if IoU exceeds a set threshold value, the predicted value is sent to a second stage, the IoU threshold value of the second stage is higher, and the predicted value is sent to a third stage again through further screening. Therefore, in the RPN network and the Multi-stage R-CNN network, soft-NMS flexible non-maximum suppression algorithm is adopted to detect defects on the metal surface, and the improvement of detection precision is realized by realizing the nearest neighbor attenuation strategy, wherein the formula of Soft-NMS is as follows:

wherein g _y The detection frame score IoU (x, y) is the intersection ratio of the detection frame and the real frame, and u is the set threshold value of non-maximum suppression.

The feature map of Banckbone is input to the RPN and Multi-stage R-CNN network part at the same time, receives the box_pred_0 from the RPN network, and realizes the adjustment of the box_pred and cls_logist by cascading a plurality of R-CNN modules, and each module sets different thresholds. And finally, classifying the final test result into an average value of n R-CNN modules, and predicting bbox into the output of the last R-CNN module.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. The metal surface defect detection method based on STM R-CNN is characterized by comprising the following steps of:

2. The STM R-CNN based metal surface defect detection method of claim 1, wherein the label classification is a manual classification of defects.

3. The STM R-CNN-based metal surface defect detection method of claim 1, wherein the backbone feature extraction network module extracts features of different dimensions from the input data, comprising:

4. The STM R-CNN-based metal surface defect detection method according to claim 1, wherein the Mix-FPN hybrid feature extraction network module outputs 5 enhancement feature graphs of p 1-p 5 by using a cross-layer cross data fusion method for 4 feature graph inputs of stage 1-stage 4.

5. The STM R-CNN-based metal surface defect detection method of claim 4, wherein the cross-layer cross data fusion method comprises:

6. The STM R-CNN-based metal surface defect detection method according to claim 1, wherein the t4+t2, t3+t1, t4+t2+t3, t3+t1+t4 feature fusion includes:

2) Will be new

3) Will be new

Performing 8 times random downsampling, and then performing add operation with T4 to obtain newIs->

4) Will be new

5) For new fused

7. The STM R-CNN based metal surface defect detection method of claim 1, wherein the loss function of the RPN network module, multi-stage R-CNN Multi-cascade detection network module is a classification cross entropy loss and a bounding box regression loss.