CN117854072B

CN117854072B - Automatic labeling method for industrial visual defects

Info

Publication number: CN117854072B
Application number: CN202410258471.6A
Authority: CN
Inventors: 李嘉欣; 钱凯; 甘如玉; 沈云帆
Original assignee: Deji Intelligent Technology Suzhou Co ltd
Current assignee: Deji Intelligent Technology Suzhou Co ltd
Priority date: 2024-03-07
Filing date: 2024-03-07
Publication date: 2024-05-07
Anticipated expiration: 2044-03-07
Also published as: CN117854072A

Abstract

The invention discloses an automatic labeling method for industrial visual defects, which comprises the following steps: collecting defect pictures and defect-free pictures from an industrial production site, establishing a semi-automatic labeling model based on SAM, customizing defect types, training the defect pictures by adopting a YOLOv model, training the pictures subjected to double cutting of attaching defect contours by adopting a ViT classification model, training the defect-free pictures by adopting a semi-supervision MemSeg model, taking the model with the optimal result as the classification model, and positioning defects of the defect images by adopting a YOLOv supervision model and a MemSeg semi-supervision model. The invention utilizes a full-automatic labeling method, adopts a supervised model YOLOv and a semi-supervised model MemSeg for training reasoning, adopts 500 defect graphs for training the supervised model, adopts 200 defect-free graphs for training the semi-supervised model, is based on a SAM algorithm, is integrated into a semi-automatic labeling and full-automatic labeling process, accelerates the speed of semi-automatic labeling, improves the accuracy, adopts a ViT classification model for training, and improves the accuracy of defect contour classification.

Description

Automatic labeling method for industrial visual defects

Technical Field

The invention relates to the field of industrial vision, in particular to an automatic labeling method for industrial vision defects.

Background

Industrial vision defects are defects of cracks, bubbles and bruises of products in production, the defects are often detected in a manual selection mode, manual detection is high in accuracy but low in efficiency, and in order to solve the problem, a defect detection mode combining a traditional algorithm and a deep learning algorithm is provided.

The defect detection mode based on deep learning needs a large amount of marking defect image support, and the characteristics of defects need to be learned during model training so as to position and classify different types of industrial defects, and manually marked data has diversity, and the marking difficulty of the industrial defect images is high, the period is long, and the quality is difficult to be unified.

Aiming at the problems, in the existing automatic labeling method, only part of the problems are often solved, such as a Chinese patent document No. CN117095394A, a semi-automatic labeling method for the defects of the transformer equipment based on SAM is disclosed.

For example, CN116721419a discloses an auxiliary labeling method combined with a large visual model SAM, the method firstly divides a picture to be labeled into a plurality of image embedding masks through the large visual model SAM, then selects the image embedding masks meeting the user requirements through mouse wheel suspension, generates a labeling frame, and repeats the steps until all pictures are labeled, and the method has more candidate embedding masks, and cannot be generated once for a plurality of times of mouse interactive clicking.

For example, CN115641323a discloses a method and a device for automatic labeling of medical images, where the method includes the first step of labeling medical images, the second step of adaptively training a corresponding segmentation model by using preprocessed image data, and finally, automatically labeling unlabeled defects by using a trained optimal model and iterating the step.

At present, automatic labeling of industrial defects still cannot be completely separated from manual intervention, and manual intervention is reduced while the labeling accuracy and efficiency are improved on the basis of the prior art.

Disclosure of Invention

The invention aims to provide an automatic labeling method for industrial visual defects, which aims to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: an automatic labeling method for industrial visual defects comprises the following steps:

step one: collecting defect pictures and defect-free pictures from an industrial production site, and cutting and cleaning the data;

Step two: establishing a semi-automatic labeling model based on the SAM, adjusting the outline shape and the size of the defect through an interactive function, generating the required defect, and customizing the defect type to finish the semi-automatic labeling;

step three: training the marked industrial defect image by adopting YOLOv models, and taking the model with the optimal result as a supervision and positioning model;

step four: training a defect-free picture by adopting a semi-supervised MemSeg model, and taking a model with an optimal result as a semi-supervised positioning model;

Step five: amplifying the marked defect image by 1 time, cutting off the outline, classifying according to categories, training by adopting ViT classification models, and taking the model with the optimal result as the classification model;

Step six: transmitting the defect image data which is not subjected to semi-automatic labeling to a full-automatic labeling tool, positioning defects by the defect image through a trained YOLOv supervision model and a MemSeg semi-supervision model, removing the defects, calculating the coordinates of a defect center point, transmitting the coordinates to a SAM model, adjusting the outline and the size by the SAM model, amplifying the outline by 1 time, classifying by a ViT classification model, and generating labeled defects.

Preferably, the SAM model in the second step includes a network structure, a loss function, a data enhancement, and a pre-training model, where the network structure adopts an encoder and a decoder, the structure is similar to a U-Net network, the encoder is used for extracting the defect image features, and is composed of a convolution layer and a pooling layer, and the decoder is composed of a plurality of deconvolution layers and an upper convolution layer, and is used for converting the feature map after the scale transformation to a target size, and outputting the final segmentation result.

Preferably, the SAM algorithm of the loss function is cross entropy based multitasking, the loss function comprising a classification loss at the pixel level for determining whether each pixel belongs to a defect, and a regression loss for determining the contour of a real defect and the predicted defect contour difference.

Preferably, the SAM algorithm adopts data enhancement technology of random rotation, scaling, clipping, flipping, color space transformation and noise addition, and is used for improving generalization capability of a model, the SAM model adopts a pre-trained image classification model as initial weight of an encoder and is used for extracting characteristics of high quality of defects, a backbone network of the SAM is based on ViT algorithm, wherein a multi-head attention mechanism is contained, and a calculation formula of the multi-head attention mechanism can be expressed as follows:

SA(Q,K,V)=softmax()V

wherein Q represents a query matrix, K is a key matrix, V is a value matrix, d _k is the dimension of keys and query, K ^T is the dot product of the key matrix K, and in the matrix calculation process, Q, K and V are respectively subjected to linear dimension increasing operation and output ，，/>，/>=/>，d_k=/>Then, H times of operations are performed on the Head part of the attention mechanism, and finally, the results are spliced together, and the Head part splicing formula can be expressed as follows:

Headi=SA(,/>,/>)

O=concat(,/>,/>)

And performing linear dimension lifting operation on Q, K and V, and finally splicing the results.

Preferably, the YOLOv network in the third step is a single-stage detection algorithm, and mainly comprises an input terminal, a backbone network, a neck network and a head network.

And the input end of the YOLOv model adopts Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling.

The backbone network mainly comprises a Focus structure, the Focus structure mainly aims at slicing the picture, the backbone network also comprises a CSP structure, the CSP structure mainly breaks down the feature map into two parts, one part performs feature extraction by using a convolution layer, and the results of the other part and the last part are fused through the number of overlapping channels.

Preferably, the neck network in step three YOLOv adopts a structure of FPN and PAN, the FPN layer conveys strong semantic information from top to bottom, and the PAN conveys positioning features from bottom to top.

The head network mainly expands a1 multiplied by 1 convolution channel for different scale feature graphs obtained in the neck network, the head network comprises 3 detection layers, 3 feature graphs with different scales corresponding to the neck respectively, the network sets 3 anchors with different aspect ratios for prediction and regression targets, and a regression calculation formula of a target frame can be expressed as follows:

=2δ(/>)-0.5+/>

=/>×(2δ(/>))²

Wherein the method comprises the steps of ，/>Representing x-axis, y-axis center point coordinates of a generated prediction frame,/>，/>Representing the width and height of the prediction box,/>，/>Representing the upper left corner coordinates of the grid where the center point of the prediction frame is located,/>，/>Representing the offset of the center point of the prediction box relative to the upper left corner coordinates of the grid,/>，/>Representing the scaling of the width-height of the prediction frame relative to the width-height of the anchor frame,/>，/>Representing the width and height of the anchor frame, YOLOv adopts an aspect ratio matching strategy, and the main flow is as follows:

(1) The width ratio (w ₁/w₂,w₂/w₁) and the high ratio (h ₁/h₂,h₂/h₁) of the manually marked label and 9 different anchor frames are calculated respectively;

(2) Respectively calculating the maximum value in (w ₁/w₂,w₂/w₁),(h₁/h₂,h₂/h₁) as the ratio of the manually marked label to the anchor frame;

(3) If the ratio of the real label to the anchor frame of the manual marking is smaller than the ratio set in the step (2), the anchor frame is responsible for predicting the real label frame, a predicted frame obtained by regression of the anchor frame is a positive sample, and the rest predicted frames are negative samples.

Preferably, the YOLOv model in the third step includes a plurality of loss functions, the effect of the loss functions is to measure the difference between the predicted result and the true labeling result, and the total loss of YOLOv may be expressed as:

Loss=box_gain×bbox_loss+cls_gain×cls_loss+obj_gain×obj_loss

Wherein box_gain, cls_gain, obj_gan respectively correspond to different loss function defaults of 0.05, 0.5, 1.0, bbox_loss represents rectangular box loss, cls_loss represents classification loss, obj_loss represents confidence loss, wherein rectangular box loss measurement basis is IoU, which represents the overlapping degree of predicted box and manually marked real box in target detection, assuming that predicted box is A, real box is B, ioU can be expressed as:

IoU=

When the predicted frame and the real frame do not intersect, ioU cannot reflect the relationship between the predicted frame and the real frame, so that gradient feedback is affected, and training is not possible, besides IoU cannot accurately reflect the degree of coincidence between the predicted frame and the real frame, YOLOv5 uses CIoU to calculate the loss of the boundary frame by default, CIoU is based on DIoU, and the aspect ratio of the anchor frame is further considered, wherein the calculation formula of DIoU is as follows:

L_DIoU=1-IoU+

Wherein b and b ^gt represent the center points of the prediction and real frames, respectively, p represents the euclidean distance between the two center points, c represents the diagonal distance of the minimum closure region of the prediction and real frames, CIoU adds an influencing factor αv on the basis of DIoU, where α is a weight parameter, which can be expressed as:

=/>

v is a measure of the uniformity of the aspect ratio, which can be expressed as:

v=(arctan/>-arctan/>)²

CIoU can be expressed as:

L_CIoU=1-IoU++αv

YOLOv5 by default, a binary cross entropy function is used to calculate the classification loss, which can be expressed as:

L=-ylogp-(1-y)log(1-p)

where y is the label corresponding to the input sample, and predicts the probability that the input sample is a positive sample for the model.

Preferably, the MemSeg algorithm in the fourth step is a semi-supervised image surface defect detection network, mainly uses variability and commonality to detect defects on the surface of an industrial product, the MemSeg algorithm model is mainly based on a U-Net network, uses a pre-training model ResNet18 as an encoder, and introduces a recording module for simulating abnormal samples, a multi-scale feature fusion module and an attention mechanism module, so that the accuracy of an abnormal positioning model is improved.

MemSeg contains an anomaly simulation strategy, which roughly includes three steps:

(1) Generating a mask image by using Perlin noise and a target prospect;

(2) Extracting ROI defined by mask image from noise image to generate noise foreground image Foreground imageCan be expressed as:

=δ(M⊙/>)+(1-δ)(M⊙/>)

For noisy images It is desirable that its maximum transparency be higher to increase the difficulty of model learning and thus the robustness of the model, so for δ in the formula, it will be sampled randomly and uniformly from [0.15,1 ];

(3) The noise foreground image is superimposed on the original image to obtain a simulated abnormal image L _A;

l _A can be represented as:

=/>⊙/>+/>

MemSeg further comprises a memory module and a spatial attention map, wherein the memory module is mainly used for recording defect-free graphs, a pre-training encoder is used for recording defect characteristics, the defect characteristics are combined to generate an image II, then the L ₂ distance between the image II and all memory information MI is calculated, and difference N difference information DI between the image and the memory samples is input:

DI= ₂

Regarding the N difference information, the best difference information DI ^* between ii and MI is obtained by taking the minimum sum of all elements in each DI as a standard, and may be expressed as:

DI^*=

Finally, multi-scale feature fusion is adopted, and the fused features flow to a decoder through the jump connection of U-Net.

MemSeg also contains a spatial attention map, using DI ^* to suggest three spatial attention patterns, to enhance the guess of the best difference information for the anomaly region, for three different dimensional features in DI ^*, calculate the average, and output three feature maps of sizes 16 x 16, 32 x 32 and 64 x 64, respectively, the 16 x 16 feature map being used directly as spatial attention pattern M ₃, which can be expressed as:

=/>

After M ₃ is up-sampled, an element multiplication operation is performed with the feature map of 32×32, so as to obtain M ₂, which can be expressed as:

=(/>)⊙/>

And after the up-sampling of M ₂, performing element multiplication operation on the up-sampled M ₂ and 7 64×64 feature maps to obtain M ₁. The spatial attention is directed to the weighting of the information obtained at C _1-3 by the graph M _1-3, respectively.

M ₁ can be expressed as:

=(/>)⊙/>

Wherein, the representation C ₁,C₂,C₃ represents the channel number; DI ^* _1i,DI^* _2i shows a feature map of channel i; m ^U ₂ and M ^U ₃ represent the feature maps M ₂ and M ₃, respectively, obtained after upsampling.

MemSeg loss function uses L ₁ loss sumLoss to guarantee similarity of all pixels in image space, L ₁ loss and/>Losses can be expressed as:

=/> ₁

=/>

Finally, the constraints are combined into an objective function.

Preferably, the ViT model in the fifth step is mainly used for classifying and training the cut small defect map, the ViT model is a transform-based image classifier, and the ViT model can be divided into the following parts:

(1) Image blocking processing: inputting fixed-size pictures, each picture can generate a plurality of fixed-size patches, the dimension after the input passes through the linear projection layer is still unchanged, and patch embedding can convert the visual problem into a seq2seq problem;

(2) Position coding: representing position codes, wherein the position codes are similar to a table, the table has N rows, the size of N is the same as the length of an input sequence, each row represents a vector, the dimension of the vector is the same as the dimension of the input sequence, and the dimension is still unchanged after the position codes are added;

(3) Transformer encoder: each Block comprises a plurality of Transformer Block sub-modules such as a multi-head attention mechanism, a full connection layer and the like;

(4) MLP classification processing: mapping the output of the transducer encoder into a class space through a full connection layer to obtain a final classification result.

Preferably, in the sixth embodiment, the step is:

Firstly, inputting unlabeled defect pictures into a full-automatic labeling tool, and adopting YOLOv supervision models and MemSeg semi-supervision models to perform reasoning and positioning on images to obtain defects after double-model reasoning;

then, de-duplicating defect center points obtained by the supervision model and the semi-supervision model, wherein the distance between any two defect center points is smaller than 10 pixel points, reserving center points of defects with higher confidence coefficient, and deleting center points of defects with lower confidence coefficient;

Then, the center point coordinates of the defects are used as input of a SAM model, the defect outline is returned, the defect outline is amplified by 1 time finally, and a final result is generated through ViT model classification, and the process can be separated from the interference of artificial factors and automatically label a large number of defects.

The invention has the technical effects and advantages that:

the invention provides a full-automatic labeling method, which greatly improves the labeling efficiency while ensuring the labeling accuracy, adopts a supervised model YOLOv and a semi-supervised model MemSeg for training reasoning, combines supervision and semi-supervision to enable defect positioning to be more accurate, reduces the condition of manual labeling defect missing labeling, adopts 500 defect graphs for training under the initial condition, adopts 200 defect-free graphs for training by the semi-supervised model, integrates the algorithm into a semi-automatic labeling and full-automatic labeling process based on a SAM algorithm, can accelerate the semi-automatic labeling speed, improves the full-automatic labeling accuracy, and adopts a ViT classifier to improve the defect contour classification accuracy.

Drawings

FIG. 1 is a schematic diagram of the overall process of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides an automatic labeling method for industrial visual defects, which is shown in figure 1, and comprises the following steps:

The SAM model in the second step comprises a network structure, a loss function, a data enhancement and a pre-training model, wherein the network structure adopts an encoder and a decoder structure, the structure is similar to a U-Net network, the encoder is used for extracting the characteristics of a defect image and consists of a convolution layer and a pooling layer, and the decoder consists of a plurality of deconvolution layers and an upper convolution layer and is used for converting the characteristic diagram after the scale transformation into a target size and outputting the final segmentation result.

The SAM algorithm of the loss function is based on cross entropy multiplexing, the loss function comprises classification loss of pixel level and regression loss, the classification loss is used for judging whether each pixel belongs to a defect, namely whether a pixel point is foreground or background, and the regression loss is used for judging the outline of a real defect and the predicted defect outline difference.

The SAM algorithm adopts data enhancement technology of random rotation, scaling, clipping, overturning, color space transformation and noise addition, is used for improving the generalization capability of a model, the SAM model adopts a pre-training image classification model as an initial weight of an encoder, is used for extracting high-quality characteristics of defects, a backbone network of the SAM is based on ViT algorithm, and comprises a multi-head attention mechanism which is used for calculating the similarity of each element in an input sequence and other elements, then calculating the weight of each element according to the similarity, and finally carrying out weighted summation output on the elements, wherein the calculation formula of the multi-head attention mechanism can be expressed as follows:

SA(Q,K,V)=softmax()V

Headi=SA(,/>,/>)

O=concat(,/>,/>)

In addition, the SAM also comprises another layer in a transducer, which is totally called a feedforward network and is mainly used for nonlinear transformation of an output sequence, the SAM can accurately divide the outline of the defect in an automatic labeling task, interact with a labeling person in real time, and have strong generalization capability on different tasks.

The YOLOv network in the third step is a single-stage detection algorithm, and mainly comprises an input end, a backbone network, a neck network and a head network.

The input end of YOLOv model adopts Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling, and these operations can enrich the number of features for improving the accuracy and robustness of detection.

The backbone network mainly comprises a Focus structure, the Focus structure mainly aims at slicing the picture, is similar to adjacent downsampling, and finally outputs a high-quality feature image without information loss, the backbone network also comprises a CSP structure, the CSP structure mainly breaks the feature image into two parts, one part of the feature image is extracted by utilizing a convolution layer, the results of the other part and the last part are fused through the number of overlapping channels, and the CSP can reduce the calculated amount, enhance the learning capacity of a model and improve the detection accuracy of the model.

The neck network in step three YOLOv adopts the structure of FPN and PAN, the FPN layer conveys strong semantic information top-down, and the PAN conveys positioning features bottom-up.

=2δ(/>)-0.5+/>

=/>×(2δ(/>))²

Wherein the method comprises the steps of ，/>Representing x-axis, y-axis center point coordinates of a generated prediction frame,/>，/>Representing the width and height of the prediction box,/>，/>Representing the upper left corner coordinates of the grid where the center point of the prediction frame is located,/>，/>Representing the offset of the center point of the prediction box relative to the upper left corner coordinates of the grid,/>，/>Representing the scaling of the width-height of the prediction frame relative to the width-height of the anchor frame,/>，/>Representing the width and height of the anchor frames, in summary, each grid on each detection layer of YOLOv contains multiple anchor frames, not every anchor frame species contains the required target, only a portion of the anchor frames need to return to the current target, so the anchor frames are divided into positive and negative samples, YOLOv adopts an aspect ratio matching strategy, and the main flow is:

(3) If the ratio of the real label to the anchor frame, which is marked manually, is smaller than the ratio set in the step (2), the anchor frame is responsible for predicting the real label frame, the predicted frame obtained by regression of the anchor frame is a positive sample, and the rest predicted frames are negative samples;

by such a method, the number of positive samples can be increased and the convergence speed of the model can be increased.

In the third step YOLOv, the model contains a plurality of loss functions, the effect of the loss functions is to measure the difference between the predicted result and the true labeling result, if the predicted result is more true, the smaller the loss function value is, the total loss YOLOv can be expressed as:

Loss=box_gain×bbox_loss+cls_gain×cls_loss+obj_gain×obj_loss

IoU=

L_DIoU=1-IoU+

=/>

v=(arctan/>-arctan/>)²

CIoU can be expressed as:

L_CIoU=1-IoU++αv

L=-ylogp-(1-y)log(1-p)

Wherein y is a label corresponding to the input sample, the positive sample is 1, the negative sample is 0, and the probability that the input sample is the positive sample is predicted for the model.

Confidence loss represents the likelihood of each prediction box, with a larger value representing a greater probability that the prediction box is accurate, and YOLOv defaults to computing confidence loss using a binary cross entropy function, as well as computing classification loss.

The MemSeg algorithm in the fourth step is a semi-supervised image surface defect detection network, mainly uses the difference and commonality to detect defects on the surface of an industrial product, the model is mainly based on a U-Net network, uses a pre-training model ResNet as an encoder, introduces a recording module for simulating an abnormal sample and a multi-scale feature fusion module, namely a MSFF module, and an attention mechanism module, and improves the accuracy of an abnormal positioning model.

(1) Generating a mask image by using Perlin noise and a target prospect;

(2) Foreground image Can be expressed as:

=δ(M⊙/>)+(1-δ)(M⊙/>)

l _A can be represented as:

=/>⊙/>+/>

DI= ₂

DI^*=

=/>

=(/>)⊙/>

M ₁ can be expressed as:

=(/>)⊙/>

=/> ₁

=/>

Finally, the constraints are combined into an objective function.

In the fifth step, viT is mainly used for classifying and training the cut small defect map, viT is a transform-based image classifier, and ViT is mainly based on the principle that: firstly, dividing an input image into a plurality of patches, then projecting each patch into a vector with a fixed length, sending the vector into a transducer, and finally adding a special token into an input sequence, wherein the output corresponding to the token is the final category prediction, and a ViT model can be divided into the following parts:

(4) MLP classification processing: mapping the output of the transducer encoder into a class space through a full connection layer to obtain a final classification result;

And fourthly, the defect outline output by the fourth step can be classified by the ViT classifier in the fifth step, and finally the defect label with the category is output.

Inputting unlabeled defect pictures into a full-automatic labeling tool, and performing reasoning and positioning on images by adopting YOLOv supervision models and MemSeg semi-supervision models to obtain defects after double-model reasoning; then, de-duplicating defect center points obtained by the supervision model and the semi-supervision model, wherein the distance between any two defect center points is smaller than 10 pixel points, reserving center points of defects with higher confidence coefficient, and deleting center points of defects with lower confidence coefficient; and then taking the center point coordinates of the defects as input of a SAM model, returning the defect outline, amplifying the defect outline by 1 time, classifying the defect outline through ViT models, and generating a final result.

Embodiment one:

Step one: collecting defect graphs and defect-free graphs of industrial production, performing interactive labeling based on SAM, and collecting labeled defect data as a training verification set of a supervised model;

Step two: based on the marked defect pictures, dividing a training set and a verification set, dividing the data volume according to 8:2, taking 500 defect pictures in total, taking 400 pictures as training set data to train YOLOv models, carrying out model iteration for 300 rounds, taking the rest 100 pictures as the verification set, and selecting a round of model with highest verification set accuracy mAP as YOLOv supervision model, wherein mAP is average prediction accuracy MEAN AVERAGE Precision of each category, wherein the Precision can be expressed as:

precision=

recall can be expressed as:

recall=

Wherein TP indicates that the inferred defect is truly a defect, FP indicates that it is not a true defect but is judged as a defect, FN indicates that it is a true defect and is not a defect, TN indicates that it is not a true defect, and AP values can be obtained by calculating precision corresponding to each recall, which can be expressed as:

AP=

the mAP of all classes can be expressed as:

mAP

Step three: training MemSeg semi-supervised models based on 200 defect-free pictures as a training set, taking 500 defect pictures as a verification set, iterating the models for 300 rounds, and selecting a round of model with highest verification set accuracy as a MemSeg semi-supervised model;

Step four: after expanding the 1-fold outline of the defects in the marked data set, cutting according to the minimum circumscribed rectangle, sorting according to the defect types, cutting 500 defect pictures to obtain 4000 defect small pictures, training 3200 defect small pictures based on ViT models, iterating the models for 300 rounds, taking the rest 800 pictures as verification sets, and selecting a round of models with the highest accuracy of the verification sets as ViT classification models.

Embodiment two:

Step one: the semi-automatic labeling model SAM of the first example, the supervision model YOLOv, the semi-supervision model MemSeg and the classification model ViT are built;

Step two: and carrying out full-automatic labeling on the unlabeled images newly acquired by the production line. Firstly, adopting YOLOv supervision model and MemSeg semi-supervision model to perform reasoning and positioning on unlabeled pictures; then to monitor

The center points of defects obtained by the model and the semi-supervised model are de-duplicated, the distance between any two defect center points is smaller than 10 pixel points, the center points of defects with higher confidence coefficient are reserved, and the center points of defects with lower confidence coefficient are deleted; and then the central point coordinates of the defects are adopted to infer the specific outline and the size of the defects by adopting a SAM model, and finally the defects in the marked dataset are enlarged by 1 time of outline, and the defects are classified and stored by adopting a trained ViT model.

The working principle of the invention is as follows: the invention provides a full-automatic labeling method, which greatly improves the labeling efficiency while ensuring the labeling accuracy, adopts a supervised model YOLOv and a semi-supervised model MemSeg for training reasoning, combines supervision and semi-supervision to enable defect positioning to be more accurate, reduces the condition of manual labeling defect missing labeling, adopts 500 defect graphs for training under the initial condition, adopts 200 defect-free graphs for training by the semi-supervised model, integrates the algorithm into a semi-automatic labeling and full-automatic labeling process based on a SAM algorithm, can accelerate the semi-automatic labeling speed, improves the full-automatic labeling accuracy, and adopts a ViT classifier to improve the defect contour classification accuracy.

Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements or changes may be made without departing from the spirit and principles of the present invention.

Claims

1. An automatic labeling method for industrial visual defects is characterized by comprising the following steps:

Step six: transmitting the defect image data which is not subjected to semi-automatic labeling to a full-automatic labeling tool, positioning defects of the defect image through a trained YOLOv supervision model and a MemSeg semi-supervision model, removing the duplication of the defects, calculating the coordinates of a defect center point, transmitting the coordinates to a SAM model, adjusting the outline and the size through the SAM model, amplifying the outline by 1 time, and classifying the outline through a ViT classification model to generate labeled defects;

The SAM model adopts data enhancement technology of random rotation, scaling, clipping, overturning, color space transformation and noise addition, and is used for improving the generalization capability of the model, the SAM model adopts a pre-training image classification model as an initial weight of an encoder and is used for extracting the characteristics of high defect quality, a backbone network of the SAM is based on ViT algorithm, wherein a multi-head attention mechanism is contained, and a calculation formula of the multi-head attention mechanism can be expressed as follows:

SA(Q,K,V)=softmax()V；

Wherein Q represents a query matrix, K is a key matrix, V is a value matrix, dk is the dimension of keys and query, KT is the dot product of the key matrix K, and in the matrix calculation process, Q, K and V are respectively subjected to linear dimension increasing operation and output ，/>，，/>=/>，dk=/>Then, H times of operations are performed on the Head part of the attention mechanism, and finally, the results are spliced together, and the Head part splicing formula can be expressed as follows:

Headi=SA(,/>,/>)；

O=concat(,/>,/>)；

Performing linear dimension lifting operation on Q, K and V, and finally splicing the results;

The MemSeg algorithm in the fourth step is a semi-supervised image surface defect detection network, mainly uses the difference and commonality to detect defects on the surface of an industrial product, the MemSeg algorithm model is mainly based on a U-Net network, uses a pre-training model ResNet as an encoder, introduces a recording module for simulating an abnormal sample, a multi-scale feature fusion module and an attention mechanism module, and improves the accuracy of an abnormal positioning model;

(1) Generating a mask image by using Perlin noise and a target prospect;

(2) Extracting ROI defined by mask image from noise image to generate noise foreground image Foreground image/>Can be expressed as:

=δ(M⊙/>)+(1-δ)(M⊙/>)；

(3) Overlapping the noise foreground image on the original image to obtain a simulated abnormal image LA;

LA may be expressed as:

=/>⊙/>+/>；

MemSeg further comprises a memory module and a spatial attention map, wherein the memory module is mainly used for recording defect-free graphs, a pre-training encoder is used for recording defect characteristics, the defect characteristics are combined together to generate an image II, then the L2 distance between the image II and all memory information MI is calculated, and difference N difference information DI between the image and a memory sample is input:

DI=2；

regarding the N difference information, the best difference information DI between ii and MI is obtained by taking the minimum sum of all elements in each DI as a standard, and may be expressed as:

DI*=；

Finally, adopting multi-scale feature fusion, and enabling the fused features to flow to a decoder through jump connection of U-Net;

MemSeg further includes a spatial attention map, using DI to present three spatial attention patterns, to enhance the guess of the best difference information for the anomaly region, for three different dimensional features in DI, calculate the average value, and output three feature maps of sizes 16×16, 32×32, and 64×64, respectively, where the 16×16 feature map is directly used as the spatial attention pattern M3, and can be expressed as:

=/>；

After M3 is up-sampled, an element multiplication operation is performed with a feature map of 32×32, so as to obtain M2, which can be expressed as:

=(/>)⊙/>；

After M2 is up-sampled, carrying out element multiplication operation on the up-sampled M2 and 7 characteristic graphs with the size of 64 multiplied by 64 to obtain M1, and respectively carrying out weighting processing on information obtained by C1-3 by using the spatial attention pattern M1-3;

M1 may be represented as:

=(/>)⊙/>；

wherein, C1, C2 and C3 represent the channel number; DI 1i, DI 2i represents a feature map of channel i; MU2 and MU3 respectively represent the feature maps M2 and M3 obtained after up-sampling;

MemSeg loss function uses L1 loss sum Loss to guarantee similarity of all pixels in image space, L1 loss and/>Losses can be expressed as:

=/>1；

=/>；

Finally, the constraints are combined into an objective function.

2. The automatic labeling method of industrial visual defects according to claim 1, wherein the SAM model in the second step comprises a network structure, a loss function, a data enhancement and a pre-training model, the network structure adopts an encoder and a decoder structure, the structure is similar to a U-Net network, the encoder is used for extracting the defect image characteristics and consists of a convolution layer and a pooling layer, the decoder consists of a plurality of deconvolution layers and an upper convolution layer, and the decoder is used for converting the feature map after the scale transformation to a target size and outputting the final segmentation result.

3. An automatic labeling method for industrial visual defects according to claim 2, wherein the SAM algorithm of the loss function is cross entropy-based multitasking, the loss function comprising a classification loss at pixel level for judging whether each pixel belongs to a defect or not and a regression loss for judging the outline of a real defect and the predicted defect outline difference.

4. The automatic labeling method for industrial visual defects according to claim 1, wherein the YOLOv network in the third step is a single-stage detection algorithm, and mainly comprises an input end, a backbone network, a neck network and a head network;

the input end of YOLOv model adopts Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling;

5. The automatic labeling method for industrial visual defects according to claim 4, wherein the neck network YOLOv in the third step adopts a structure of FPN and PAN, the FPN layer conveys strong semantic information from top to bottom, and the PAN conveys positioning features from bottom to top;

=2δ(/>)-0.5+/>；

=/>×(2δ(/>))^2;

Wherein the method comprises the steps of ，/>Representing x-axis, y-axis center point coordinates of a generated prediction frame,/>，/>Representing the width and height of the prediction box,，/>Representing the upper left corner coordinates of the grid where the center point of the prediction frame is located,/>，/>Representing the offset of the center point of the prediction box relative to the upper left corner coordinates of the grid,/>，/>Representing the scaling of the width-height of the prediction frame relative to the width-height of the anchor frame,/>，/>Representing the width and height of the anchor frame, YOLOv adopts an aspect ratio matching strategy, and the main flow is as follows:

6. The automatic labeling method for industrial visual defects according to claim 5, wherein the YOLOv model in the third step comprises a plurality of loss functions, the effect of the loss functions is to measure the difference between the predicted result and the true labeling result, and YOLOv5 total loss can be expressed as;

Loss=box_gain×bbox_loss+cls_gain×cls_loss+obj_gain×obj_loss；

IoU=；

L_DIoU=1-IoU+；

=/>；

v=(arctan/>-arctan/>)^2;

CIoU can be expressed as:

L_CIoU=1-IoU++αv；

L=-ylogp-(1-y)log(1-p)；

7. The automatic labeling method for industrial visual defects according to claim 1, wherein the ViT model in the fifth step is mainly used for classifying and training the cut small defect map, the ViT model is a transducer-based image classifier, and the ViT model can be divided into the following parts:

8. The automatic labeling method for industrial visual defects according to claim 1, wherein the specific implementation manner in the sixth step is as follows: