CN116994047A

CN116994047A - Small sample image defect target detection method based on self-supervision pre-training

Info

Publication number: CN116994047A
Application number: CN202310955804.6A
Authority: CN
Inventors: 洪兆瑞; 于重重; 仇宁海; 赵霞
Original assignee: Nanjing Lingtong Huizhi Technology Co ltd; Beijing Technology and Business University
Current assignee: Nanjing Lingtong Huizhi Technology Co ltd; Beijing Technology and Business University
Priority date: 2023-08-01
Filing date: 2023-08-01
Publication date: 2023-11-03

Abstract

The invention discloses a small sample image defect target detection method based on self-supervision pre-training, which adopts a migration learning paradigm in small sample target detection, namely, pre-training is carried out on a basic sample, and fine adjustment is carried out on a new sample; pretraining by using a large number of base normal picture samples to perform base pretraining; then, carrying out fine adjustment on the trained network model on a small number of new types of defect picture samples to continue training; and finally, testing on the test picture set, and detecting a new class, namely a defect class. By adopting the technical scheme of the invention, the method has strong robustness and generalization capability in the scene of small sample target detection, and can improve the detection precision of small sample image target detection. The method can be applied to high-speed rail infrastructure image processing and target detection.

Description

Small sample image defect target detection method based on self-supervision pre-training

Technical Field

The invention relates to a deep learning self-supervision pre-training image processing method and a deep learning small sample image target detection method, in particular to a small sample image defect target detection model method based on self-supervision pre-training and added with a multi-scale attention mechanism and a context semantic fusion module, which can be applied to high-speed rail infrastructure image processing and target detection and belongs to the technical field of computer vision.

Background

The high-speed railway infrastructure is an important part for guaranteeing the running safety of the high-speed railway train, so that the defect detection of the high-speed railway infrastructure is of great significance for maintaining the stable running of the high-speed railway aiming at the periodic inspection of the high-speed railway infrastructure. In recent years, with the rapid development of deep learning, more and more deep learning-based models are applied to railway infrastructure defect detection. Wei et al propose an intelligent method for on-line detection of the state of a pantograph slide plate based on deep learning and image processing technology. Subsequently, ye et al propose a target detection method based on a differential feature fusion convolutional neural network aiming at the railway target detection problem. In addition, liu et al propose a high-speed railway support sleeve screw detection method based on improved fast RCNN. While existing deep learning-based models have achieved good results in railway infrastructure defect detection, these methods all require a large number of labeling samples. However, many defect samples in tall iron infrastructure are not easily found and collected, such as loosening and missing of iron tower bolts, missing of bridge steel structure bolts, missing of rail fasteners along the track, etc., and at the same time, tall iron infrastructure defects have problems with small targets that are difficult to detect. Therefore, it is difficult for conventional models based on a large number of labeled samples to detect these defects in the case of small samples.

Disclosure of Invention

In order to solve the problems in the prior art, the invention realizes a small sample image target detection method based on self-supervision pre-training, which is used for detecting defects of high-speed rail infrastructure, and a multi-scale attention mechanism and context semantic fusion module is added, so that the detection precision of small sample image target detection can be improved. The small sample image target detection method adopts a migration learning paradigm in small sample target detection, namely, pretraining is carried out on a basic sample, and fine tuning (namely, continuous training) is carried out on a new sample.

The technical scheme provided by the invention is as follows:

a method for detecting a small sample image defect target based on self-supervision pre-training comprises the following steps:

1) Performing self-supervision pre-training to construct a small sample target detection network model; training a backbone network in the small sample target detection network model by adopting a SlotCon self-supervision pre-training method to obtain a backbone network weight after self-supervision pre-training, wherein the backbone network weight is used as an initial weight of the backbone network of the small sample target detection network model;

the method comprises the steps of adopting a SlotCon self-supervision pre-training method to combine with small sample target detection, using a large amount of label-free high-altitude iron infrastructure data to perform pre-training on the self-supervision method SlotCon, and replacing the backbone network weight after the self-supervision pre-training with the network weight after the supervision pre-training as the initial weight of a backbone network of a small sample target detection network model constructed by the method;

the small sample target detection network model constructed by the invention comprises a main network, a gradient decoupling layer (GDL layer), a region generation network RPN, a region of interest pooling structure RoIPooling, a context semantic fusion module, a classifier and a frame regressor;

in specific implementation, the invention adopts a SlotCon self-supervision method to train the backbone network ResNet101, and performs contrast learning from a data-driven semantic slot (slots) for combining semantic grouping and representation learning (effective features are automatically learned through an algorithm, and the performance of a model is improved). Semantic grouping is achieved by assigning pixels to a set of learnable prototypes that can focus on features to fit each sample and form a new slot. Based on the learned data dependent slots, the comparison target is adopted for representation learning, so that the feature resolvability is enhanced. And replacing the trained ResNet101 backbone network weight with the supervised pre-training weight, and carrying out subsequent target detection.

2) The invention adopts a migration learning paradigm for training the target detection of a small sample, firstly uses a large number of normal samples of a base class for pretraining (the base class is pretrained), carries out fine adjustment (continuous training) on a small number of new class defect samples of a network model after the base class pretraining, and finally carries out test on a test set to detect the new class (defect class). Wherein the same network model structure is used for both the pre-training and fine-tuning stages. Specifically, a high-speed iron infrastructure picture is sent to a small sample target detection network model, firstly, a main network of the small sample target detection model is passed through, the main network is used for extracting the characteristics of an image, the main network is composed of a residual network ResNet101 (backbone network) and a characteristic pyramid FPN, a SENet (squeeze and excitation network) attention mechanism is added to the last layer of the residual network to form a multi-scale attention mechanism (SE-MAM, multiscale attention mechanism based on squeeze and excitation network) based on SENet, a characteristic map after passing through the main network is sent to a GDL layer for forward propagation, the obtained output and input region generates a network RPN and a region of interest pooling structure RoIPooling, wherein the RPN network is used for providing a regression frame with a possible target, and generating a recommended information characteristic vector with a target score and a boundary frame regression offset, and the RoIPooling is used for obtaining an output characteristic map with a fixed size from regions of interest with different sizes in the input characteristic map by using a pooling method;

3) And outputting the output feature map obtained after RoIPooling to a classifier and a frame regressor through a context semantic fusion module (context semantic fusion module, CSF), wherein the classifier calculates the probability of an object in a candidate frame as each category through a softmax function, the final predicted category is the category with the largest output probability, the frame regressor calculates the loss value of the offset between the predicted value and the true value through a smoothL 1 loss function, and the original candidate frame is corrected through the offset to obtain the final predicted frame coordinate, namely the position of the identified target, so that the small sample target detection is realized.

In order to solve the problem that the features provided by the supervised pre-training method under the complex railway background are not targeted, the invention uses a self-supervision pre-training method SlotCon, and the weight obtained by self-supervision pre-training is used as the backbone network weight of the small sample detector; in order to solve the problem of recognition capability of a model on a small target and improve sensitivity on channel characteristics, the invention provides a multi-scale attention mechanism, wherein the multi-scale attention mechanism comprises a Feature Pyramid (FPN) and a SENet attention mechanism, and a feature map output after passing through a network of the multi-scale attention mechanism enters a Gradient Decoupling Layer (GDL) and is used for adjusting decoupling degrees among different modules. In forward propagation, the feature representation is simply enhanced with affine transformation layer A, and in reverse propagation, the GDL takes the gradient from the subsequent layer and multiplies it by the coefficient λε [0,1]]And then transferred to the previous layer. The GDL can be considered as a pseudo function G defined by two equations _(A，λ) These two equations describe their forward and backward propagation behavior as follows:

G _(A，λ) (x)＝A(x) (1)

wherein Jacobian matrix, which is an affine transformation layer; x is an input feature map; a (x) denotes inputting of the feature map into the affine transformation layer.

The context semantic fusion module (CSF) is used for fusing features of different scales, simultaneously learning good global and local features, and the output after the context semantic fusion module is sent to the classifier and the frame regression, the prediction category score is obtained through the classifier, and the prediction coordinates are obtained through the frame regression, so that the small sample target detection is realized. The implementation result shows that the method has strong robustness and generalization capability in the scene of small sample target detection, and the detection effect is superior to that of an SOTA small sample target detection model (state-of-the-art, model with optimal current effect).

Drawings

Fig. 1 is a schematic diagram of a self-supervision method SlotCon framework used in the present invention.

FIG. 2 is a schematic diagram of a network structure of a small sample image defect target detection model based on a multi-scale attention mechanism and a context semantic fusion module constructed in the present invention.

Detailed Description

The model structure of the invention comprises: (1) A backbone network portion comprising a residual network structure, a feature pyramid structure (FPN), and a SENet attention mechanism; (2) RPN and rotooling with gradient decoupling layers; (3) The context semantic fusion module is used for sending the output after the context semantic fusion module to a classifier and a frame regression, wherein the classifier obtains a prediction category score, and the frame regression obtains a prediction coordinate.

The self-supervision pretraining method SlotCon framework used in the invention is shown in figure 1, for an unlabeled image data set D, a group of prototypes S is hoped to be obtained through the self-supervision pretraining SlotCon, the pixels in the image are classified, the pixels in the same group of prototypes have similar characteristic representation, and the SlotCon uses a pixel-level depth clustering method to obtain the prototypes S. Specifically, the SlotCon self-supervision method consists of two student networks and a teacher network with the same structure but different parameters, wherein the student networks comprise an encoder f _θ And a mapping layer (Projector) g _θ K learnable prototypesThe teacher's network weight set is ζ, and the network weights are updated with exponential moving averages. Given an input image x, two enhancement views v are generated using two random enhancement methods ^l ∈{v ¹ ,v ² Encoder f output feature map through student network and teacher network respectively->(after reinforcementThe feature image of the picture output after passing through the encoder of the student and teacher network comprises three dimensions of height, width and channel), and the feature is obtained by a multi-layer perceptron (MLP)(the characteristic diagram obtained after passing through the encoder and then passing through the multi-layer perceptron comprises three dimensions of height, width and channel). Then use prototype S _θ Computing features->Assignment (assignment)/(assignment)>Bringing it into another view v with the teacher's network ^l ' assignment generated->Matching. At the pixel level, the assignment of identical pixels to prototypes is consistent for each location for overlapping (overlapping) regions in both views. On object-level token learning, semantically identical pixels (pixels) on a feature map (feature map) are clustered together. Contrast learning is performed on semantic slots (slots) between different views. Promote each other in both directions and optimize each other. In the self-supervision pre-training stage, a SlotCon self-supervision method is adopted to train a backbone network ResNet101, and contrast learning is carried out from a data-driven semantic slot (slots) for joint semantic grouping and representation learning. Semantic grouping is achieved by assigning pixels to a set of learnable prototypes that can focus on features to fit each sample and form a new slot. Based on the learned data dependent slots, the comparison target is adopted for representation learning, so that the feature resolvability is enhanced. And replacing the trained ResNet101 backbone network weight with the supervised pre-training weight, and carrying out subsequent target detection.

As shown in fig. 2, the backbone network of the present invention includes: the problem of small defect sample targets is solved by adopting the multi-scale feature pyramid, and the feature pyramid structure comprises a bottom-up approach, a top-down approach and transverse connection. Specifically, a typical CNN model is adopted in the bottom-up stage, feature graphs with different sizes are obtained through convolution operation of a series of CNN models, the feature graphs with the same size are classified into one stage (stage), each extracted feature is the last layer output of each stage, so that a feature pyramid can be formed, 4 stages, namely conv2 (C2), conv3 (C3), conv4 (C4) and conv5 (C5), are generated from bottom to top, and the dimension sizes of the feature graphs output by C2, C3, C4 and C5 after a series of convolution operation are 56×56×256, 28×28×512, 14×14×1024 and 7×7×2048 respectively; in the stage from top to bottom, the high-level feature images are up-sampled, wherein the up-sampling is 2 times of up-sampling, the height and width of the feature images are ensured to be the same after up-sampling, and the addition fusion operation can be performed by using transverse connection. The up-sampling algorithm adopts a nearest neighbor interpolation algorithm. The feature is then cross-connected (lateral connections) to the previous layer feature so that the higher layer features are reinforced, and a convolution kernel of 1 x 1 size is used in the cross-connection to mainly adjust the number of output channels of different feature layers so that the number of channels is 256, so that other up-sampled features can be added. Each lateral join merges feature maps of the same spatial dimension on a bottom-up and top-down path. After fusion, each fusion result is convolved by adopting a convolution check with the size of 3×3, so as to eliminate the aliasing effect of up-sampling. Suppose that the generated feature map results are P2, P3, P4, P5, and correspond to the original bottom-up convolution results C2, C3, C4, C5 one-to-one.

SENet displays modeling channel interdependencies by compression-excitation blocks to enhance the learning of convolution features so that the network increases its sensitivity to information available for subsequent conversion. In order to alleviate the problem that each unit of the output U cannot utilize the context information outside the area due to the fact that a local receptive field is used by a convolution kernel in the convolution process, global space information is compressed into channel dimensions by using global average pooling. Formally, statisticsIs generated by narrowing the spatial dimension H x W of U, the c-th element of z can be calculated by the following formula:

to capture channel dependencies using information aggregated in a compression operation, a second operation uses a simple gating mechanism with a sigmoid activation function:

where delta is the ReLU activation function,in order to reduce the complexity of the model and improve the generalization capability, a bottleneck structure of two fully connected layers (FCs) is adopted, wherein the first FC layer plays a role in dimension reduction, the dimension reduction coefficient is r, then a ReLU activation function is adopted, and the final FC layer restores the original dimension. Multiplying the learned activation values of the channels by the original features on U to obtain a final output:

wherein F _scale (u _c ,s _c ) Is the index quantity s _c And a feature map u _c ∈R ^H×W Channel product between them.

After the SENet is added to the last stage of the ResNet101, the SE-MAM module does not damage the complete structure of the residual network, and the feature layer of each scale can obtain the sensitivity to channel features.

As shown in fig. 2, the present invention proposes to use a context semantic fusion module (CSF) to solve the problem of information loss caused by the easy single pooling operation in the training process. After RoIPooling, three different-scale pooling is introduced and semantic fusion is carried out so as to better acquire global and local features, specifically, a context semantic fusion module does not use fixed resolution, specifically, three different resolutions of 6, 12 and 18 are selected for parallel pooling operation, and more comprehensive feature representation is acquired. The large resolution can pay more attention to global information, the small resolution pay more attention to local information, and the global information and the local information are better utilized to detect objects. After pooling, the features of each resolution are subjected to semantic fusion through two branches, wherein the first branch comprises a full-connection layer; the second branch comprises a global average pooling layer, a full connection layer and an up-sampling layer; and merging the three features with the three different resolutions after fusion, and finally, up-sampling by convolution with the size of 1 multiplied by 1, and restoring the original size of the feature map to output.

And the output after the CSF module is subjected to further extraction of features through two full-connection layers, and finally, a frame regression device and a classifier are used for carrying out boundary frame regression and class prediction, so that the detection of the small sample image defect target based on self-supervision pre-training is realized.

When the method is implemented, the unmanned aerial vehicle is used for shooting the acquired image data set, and the method is implemented and evaluated, and is further described as follows:

1. experiments were performed on collected high-iron infrastructure datasets taken by unmanned aerial vehicles. As shown in table 1, the unmanned aerial vehicle aerial image dataset collected by us contains 16 categories of high-speed rail infrastructure, 5091 images are firstly divided into a training image set and a test image set according to the ratio of 1.2:1, and then the categories of the high-speed rail infrastructure dataset are divided into a base category and a new category according to the data set dividing method of small sample target detection, wherein the normal category is used as the base category, and the defect category is used as the new category; according to the partitioning method, k=1, 2,3,5, 10 samples are randomly selected as fine-tuned samples for each of the new class (new) classes.

In the experiment, average Precision (AP) is selected as an evaluation index, and the detection performance of different algorithms is evaluated, wherein mAP50 is the average value of all the classes of APs 50. The AP50 is a threshold value of 0.5 as IoU (an intersection ratio indicating the degree of overlap between an object detected by the model and a real object) when calculating the AP, which is the average accuracy.

Table 1 data distribution of unmanned aerial vehicle data set

The training of the small sample image target detection model adopts an end-to-end random gradient descent algorithm to optimize parameters, wherein the random gradient descent algorithm refers to that in each iteration, a small batch of samples are randomly selected to calculate the gradient of a loss function, and the parameters are updated by the gradient. The experimental setup is shown in table 2:

table 2 experimental parameter settings

The experimental environment is shown in table 3.

TABLE 3 Experimental Environment

The number of iterations of the base class pre-training model and the number of iterations using the new class fine tuning model are shown in table 4:

table 4 model cases during training and fine tuning

2. Firstly, a self-supervision method SlotCon is used for training a backbone network ResNet101, and the trained weight is used as the backbone network initial weight of a small sample image target detection model. Then, the pictures are sent into a small sample image target detection network model for training and reasoning. The pictures enter a network model in batches, the input pictures are converted from a numpy format to a tensor format by using a deep learning framework pytorch and are input, the output characteristic pictures enter a gradient decoupling layer GDL through a main network composed of a residual network structure and a SENET-based multi-scale attention mechanism, the obtained outputs are respectively transmitted into an RPN and a region-of-interest pooling structure RoIPooling, the characteristic pictures obtained after the RoIPooling are output to a classifier and a frame regressor through a context semantic fusion module (CSF), the classifier obtains the final prediction category score, and the frame regressor obtains the final prediction coordinates.

3. And carrying out a small sample defect target detection experiment based on a self-supervision pre-trained small sample target detection model to obtain an experiment result. The average accuracy mean mAP50 results for 6 new classes of unmanned aerial vehicle data sets are shown in table 5.

Obviously, the small sample defect detection model based on self-supervision pre-training provided by the invention has better detection effect than other SOTA methods in new types. In all the comparative methods, the present invention gave the best detection results of high-speed iron infrastructure defects (27.4%, 30.5%,33.6% and 34.0% mAP50 on 1,2,3,5 shots, respectively) except for 10 shots

Finally, it should be noted that the examples are disclosed for the purpose of aiding in the further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. A method for detecting a small sample image defect target based on self-supervision pre-training is characterized in that a migration learning paradigm in small sample target detection is adopted, namely, pre-training is carried out on a basic sample, and fine adjustment is carried out on a new sample; the method comprises the following steps:

1) Performing self-supervision pre-training to construct a small sample target detection network model; training a backbone network in the small sample target detection network model by adopting a self-supervision pre-training method to obtain the weight of the backbone network after self-supervision pre-training, wherein the weight is used as the initial weight of the backbone network of the small sample target detection network model;

the constructed small sample target detection network model comprises a main network, a gradient decoupling layer (GDL layer), a region generation network (RPN), a region of interest Pooling structure (RoI Pooling), a context semantic fusion module (CSF), a classifier and a frame regressor;

2) Training by adopting a migration learning paradigm of small sample target detection: firstly, pretraining by using a large number of basic normal picture samples, namely, performing basic pretraining; then, carrying out fine adjustment on the trained network model on a small number of new types of defect picture samples to continue training; finally, testing is carried out on the test picture set, and a new class, namely a defect class, is detected; the same network model structure is used in the two stages of pre-training and fine tuning; the specific process comprises the following steps:

sending the picture into a small sample target detection network model, and through a main network of the small sample target detection model, wherein the main network is used for extracting the characteristics of the image, and consists of a residual error network ResNet101 and a characteristic pyramid FPN, and simultaneously, adding a SENet attention mechanism into the last layer of the residual error network to form a multi-scale attention mechanism SE-MAM based on SENet;

the feature map after the backbone network is sent to a GDL layer for forward propagation, and the obtained output and input region generating network RPN and a region of interest Pooling structure RoI Pooling are carried out, wherein the RPN network is used for providing a regression frame with a possible target, and generating a suggested information feature vector with a target score and a regression offset of the boundary frame; the RoI Pooling is used for obtaining an output characteristic diagram with fixed size from the regions of interest with different sizes in the input characteristic diagram by using a Pooling method;

3) The output feature map obtained after RoI Pooling is output to a classifier and a frame regressor through a context semantic fusion module; the classifier obtains the probability of the object in the candidate frame as each category through calculation, and the category with the highest probability is output as the prediction category; the frame regressor calculates a loss value of the offset between the predicted value and the true value by using the loss function, and corrects the candidate frame by the offset to obtain the predicted frame coordinate, namely the position of the identified target;

and realizing small sample target detection by using the trained small sample target detection network model.

2. The method for detecting the small sample image defect target based on the self-supervision pre-training as claimed in claim 1, wherein a self-supervision pre-training method SlotCon is adopted to train a backbone network ResNet101, and contrast learning is carried out from data-driven semantic slots for joint semantic grouping and representation learning; semantic grouping is achieved by assigning pixels to a set of learnable prototypes, concentrating features to adapt to each sample, and forming a new slot; based on the learned data dependent slots, the comparison target is adopted for representation learning, and the feature resolution is enhanced.

3. The method for detecting the small sample image defect target based on self-supervision pre-training according to claim 2, wherein the multi-scale attention mechanism comprises a feature pyramid FPN and a SENet attention mechanism, and a feature map output after passing through a multi-scale attention mechanism network enters a gradient decoupling layer GDL for adjusting the decoupling degree among different modules;

in forward propagation, adopting an affine transformation layer A to enhance characteristic representation; during counter-propagating, the GDL acquires gradients from the subsequent layer, multiplies the gradients by coefficients lambda E [0,1] and transmits the gradients to the previous layer;

the GDL is defined as a pseudo function G defined by two equations _(A，λ) Behavior describing forward and backward propagation, expressed as:

G _(A，λ) (x)＝A(x) (1)

wherein ,jacobian matrix, which is an affine transformation layer; x is an input feature map; a (x) denotes inputting of the feature map into the affine transformation layer.

4. The method for detecting the small sample image defect target based on the self-supervision pre-training according to claim 1, wherein the classifier is specifically used for calculating the probability of each class of the object in the candidate frame through a softmax function; the block regressor specifically calculates a loss value of the offset between the predicted value and the true value using a smooth L1 loss function.

5. The method for detecting the small sample image defect target based on self-supervision pre-training according to claim 1, wherein a SENet attention mechanism is added to the last layer of the residual network; the SENet is a study that shows modeling channel interdependencies through a compression-excitation block to enhance convolution characteristics; the process comprises the following steps:

assuming the convolved output as U, generating statistics by reducing the spatial dimension H W of UThe c-th element of z is calculated by the following formula:

a gating mechanism with sigmoid activation function is used to capture channel dependencies, expressed as:

s＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ δ(W ₁ z)) (4)

where delta is the ReLU activation function,

adopting two full-connection-layer FC bottleneck structures for dimension reduction and recovery respectively; multiplying the learned activation values of the channels by the features on U to obtain a final output, expressed as:

wherein ,F _scale (u _c ,s _c ) Is the index quantity s _c And a feature map u _c ∈R ^H×W Channel product between them.