CN114627292A

CN114627292A - Industrial shielding target detection method

Info

Publication number: CN114627292A
Application number: CN202210227869.4A
Authority: CN
Inventors: 王慧燕; 林文君; 闫义祥; 何浩
Original assignee: Hangzhou Xiaoli Technology Co ltd; Zhejiang Gongshang University
Current assignee: Hangzhou Xiaoli Technology Co ltd; Zhejiang Gongshang University
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-14
Anticipated expiration: 2042-03-08
Also published as: CN114627292B

Abstract

In three continuous WigglTransformer Block operations, different types of W-MSA modules, WWL-MSA modules and WWR-MSA modules are adopted for window segmentation when a window segmentation step in each Wiggle Transformer Block operation is executed, and attribute calculation is respectively carried out in each window formed after segmentation. Different MSA modules can extract different window positions, more cross-window connections are achieved, interaction between windows is increased, the necessity of picture edge information interaction is reserved, and the overall performance and robustness are improved. The attention calculation is carried out in each window, so that the method has the advantages of introducing the limitation of the CNN convolution operation and saving the calculation amount to meet the requirement of industrial application. The method and the device can reduce the influence of the shielding object on the detection target, and can increase detail characteristics in multiple levels so as to improve the accuracy of the detection of the shielding target.

Description

Industrial shielding target detection method

Technical Field

The application relates to the technical field of computer vision, in particular to an industrial occlusion target detection method.

Background

In recent years, with the rapid development of deep learning technology, related technologies have been widely applied in the field of target detection. The target detection is taken as one of the more basic identification technologies to provide technical support for the fields of intelligent transportation, social security, industrial internet and the like, and the method has wide application scenes.

The occlusion situation generally exists in the target detection in various fields, and therefore, the occlusion target detection is a difficult problem in the field of target detection and is one of the most widely concerned problems in the industrial application process of the target detection method. According to the difference of shelters, sheltering a target in a real scene is mainly divided into two situations, namely the target to be detected is sheltered by an interference object, which often causes the loss of target information and leads to the missing detection of the target; the other is caused by occlusion between the targets to be detected, which often easily introduces a large amount of interference information, thereby causing a false detection situation of the targets.

Target detection has been dominated by convolutional neural networks for a relatively long period of time. Although the detection algorithm of the convolutional neural network achieves good effect, most methods only use the last layer of features of the convolutional neural network, the diversity of the features with different receptive fields on the scale cannot be realized, and the system computation amount is increased along with the number of the convolutional layers, so that the computation is time-consuming. In particular, convolution operations are good at extracting local features, but have difficulty capturing global representations. Some people have combined Self-orientation with ResNet for the task of target detection. The combination of both achieves a tradeoff between accuracy and speed over a pure convolution, compared to the corresponding ResNet architecture. However, the problem of how to accurately embed local features and global representations remains that their expensive memory accesses cause their actual latency to be significantly greater than for convolutional networks. The following Vision Transformer has made some progress in image classification, but its structure is not suitable as a general backbone network for intensive Vision tasks or input images. Due to the matrix property of the image, at least hundreds of pixels are required for expressing the picture information in one picture, and the hundreds of long sequences of data for modeling cause that the calculation amount of a transform is large, the balance between the precision and the access memory cannot be realized, and the balance between the picture processing speed and the like cannot be realized.

In summary, for the industrial blocked target detection, the main problem at present is that the detection precision is insufficient, and the detection speed is difficult to meet the requirement.

Disclosure of Invention

Therefore, the industrial occlusion target detection method is needed to be provided for solving the problems that the traditional industrial occlusion target detection method is insufficient in detection precision and the detection speed is difficult to meet the requirement. The method provided by the application can be widely applied to the field of various industrial detections.

The application provides an industrial blocked target detection method, which comprises the following steps:

inputting a picture to be processed, carrying out patch segmentation on the picture to be processed and applying a linear embedding layer for dimensionality reduction to obtain a first characteristic diagram;

carrying out three times of continuous WiggeTransformer Block operations on the first characteristic diagram to generate a second characteristic diagram;

inputting the second characteristic diagram into a Patch gathering module to execute a resolution operation, and carrying out three times of continuous WiggeTransformer Block operations on the second characteristic diagram after the resolution operation to generate a third characteristic diagram;

inputting the third characteristic diagram into a Patch gathering module to execute a resolution operation, and performing six continuous WiggeTransformer Block operations on the third characteristic diagram after the resolution operation to generate a fourth characteristic diagram;

inputting the fourth feature diagram into a Patch measuring module to execute a structure operation, and performing three times of continuous WiggeTransformer Block operations on the fourth feature diagram after the structure operation to generate a fifth feature diagram;

sending the fifth feature map into an RPN network, detecting the occlusion target of the fifth feature map through the RPN network, and outputting the detection result of the occlusion target;

in three times of continuous WiggleTransformer Block operation, a W-MSA module, a WWL-MSA module and a WWR-MSA module are respectively adopted for window segmentation.

The application relates to an industrial occlusion target detection method, which improves Swin Transformer Block modules in a Swin Transformer network, and adopts MSA modules of different types when executing a window segmentation step in Wiggle Transformer Block operation every time in three continuous Wiggle Transformer Block operations. In three times of continuous WiggleTransformer Block operation, a W-MSA module, a WWL-MSA module and a WWR-MSA module are respectively adopted for window segmentation, and attention calculation is respectively carried out in each window formed after segmentation. Different MSA modules can extract different window positions, more cross-window connections are realized, interaction between windows is increased, the necessity of picture edge information interaction is reserved, and the global property and robustness are improved. The attribute calculation is carried out in each window, so that the method has the advantages that the limitation of CNN convolution operation can be introduced, the calculated amount can be saved on the other hand, and the detection speed of the model is improved to meet the requirements of industrial application. And through the combination of the patch, the resolution can be reduced, the number of channels can be adjusted to form a hierarchical design, and a certain amount of calculation can be saved. Meanwhile, the influence of the shielding object on the detected target can be reduced, and the detail characteristics can be added in multiple levels, so that the accuracy of the detection of the shielded target is improved, and the requirement of industrial application on high precision is met.

Drawings

Fig. 1 is a schematic flow chart of an industrial occluded target detection method according to an embodiment of the present application.

Fig. 2 is a network structure diagram of an industrial occlusion target detection method according to an embodiment of the present application.

Fig. 3 is a Block diagram of a flow of three consecutive Wiggle transform Block operations in an industrial occluded target detection method according to an embodiment of the present application.

Fig. 4 is a schematic diagram of window cutting in the industrial occluded target detection method according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a W-MSA module window reconfiguration method in the industrial occlusion target detection method according to an embodiment of the present application.

Fig. 6 is a schematic diagram of a WWL-MSA module window reorganization mode in the industrial occlusion target detection method provided in an embodiment of the present application.

Fig. 7 is a schematic diagram of a WWR-MSA module window reorganization mode in the industrial occlusion target detection method according to an embodiment of the present application.

Fig. 8 is a schematic diagram of selecting 2 × 2Patch blocks at intervals in the height direction and the width direction of the second feature map for splicing by using a Patch measuring module in S320 in the method for detecting an industrial occlusion target according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The application provides an industrial blocked target detection method. It should be noted that the industrial occluded object detection method provided by the present application is applied to pictures in which mutually occluded objects appear.

In addition, the industrial occlusion target detection method provided by the application is not limited to the execution subject. Optionally, an execution main body of the industrial occluded target detection method provided by the present application may be an industrial occluded target detection terminal.

As shown in fig. 1, in an embodiment of the present application, the method for detecting an industrial occlusion target includes the following steps S100 to S600:

s100, inputting a picture to be processed, carrying out patch segmentation on the picture to be processed, and applying a linear embedding layer to carry out dimensionality reduction to obtain a first feature map.

S200, carrying out three times of continuous Wiggle Transformer Block operations on the first characteristic diagram to generate a second characteristic diagram.

S300, inputting the second characteristic diagram into a Patch Merging module to execute a resolution operation, and performing three times of continuous Wiggle Transformer Block operations on the second characteristic diagram after the resolution operation to generate a third characteristic diagram.

S400, inputting the third feature map into a Patch Merging module to execute a resolution operation, and performing six continuous Wiggle Transformer Block operations on the third feature map after the resolution operation to generate a fourth feature map.

S500, inputting the fourth feature map into a Patch Merging module to execute a resolution operation, and performing three times of continuous Wiggle Transformer Block operations on the fourth feature map after the resolution operation to generate a fifth feature map.

S600, the fifth feature map is sent to an RPN network, the RPN network is used for detecting the occlusion target of the fifth feature map, and the occlusion target detection result is output.

In three times of continuous Wiggle Transformer Block operation, a W-MSA module, a WWL-MSA module and a WWR-MSA module are respectively adopted for window segmentation, and attention calculation is respectively carried out in each window formed after segmentation.

Specifically, the picture to be processed is a picture in which mutually-occluded objects appear.

In three continuous wiggleTransformer Block operations, the W-MSA module, the WWL-MSA module and the WWR-MSA module are respectively adopted for window segmentation, and attention calculation is respectively carried out in each window formed after segmentation.

As shown in fig. 2, fig. 2 is a network structure diagram of the industrial occluded target detection method according to an embodiment of the present application, which briefly describes the flow of the whole industrial occluded target detection method, that is, briefly describes S100 to S600.

The Image in fig. 2 is the picture to be processed. stage1 is S100 to S200. stage2 is S300. stage3 is S400. stage4 is S500. And the last RPN part is S600, namely, the RPN network is used for carrying out occlusion target detection on the fifth characteristic diagram, and occlusion target detection results are output, wherein the number of the target detection results is 2, one is a class loss value, and the other is a bbox _ loss value.

In this embodiment, a Swin Transformer Block module in a Swin Transformer network is improved, and in three consecutive wiggleTransormer Block operations, different types of MSA modules are used when a window splitting step in each wiggleTransormer Block operation is executed. In three times of continuous WiggleTransformer Block operation, a W-MSA module, a WWL-MSA module and a WWR-MSA module are respectively adopted for window segmentation. Different window positions can be extracted by different MSA modules, more cross-window connections are realized, the interaction between windows is increased, the necessity of picture edge information interaction is reserved, and the global property and the robustness are improved. The attention calculation is carried out in each window, so that the method has the advantages that the limitation of CNN convolution operation can be introduced, on the other hand, the calculation amount can be saved, and the detection speed of the model can be improved to meet the requirements of industrial application. And through the combination of the patch, the resolution can be reduced, the number of channels can be adjusted to form a hierarchical design, and a certain amount of calculation can be saved. Meanwhile, the influence of the shielding object on the detected target can be reduced, and the detail characteristics can be added in multiple levels, so that the accuracy of the detection of the shielded target is improved, and the requirement of industrial application on high precision is met.

In addition, three convolution operations are added in series in the last step of each WiggleTransformer Block operation, and local features extracted by the convolution operations are merged into the transformers to enhance representation learning, so that the representation capability of the local features and the global representation is better reserved.

In an embodiment of the present application, S100 includes the following S110 to S130:

s110, inputting a picture to be processed with the height of H, the width of W and the number of channels of 3.

Specifically, for example, H is 224 pixels, and W is 224 pixels, the input picture to be processed is a 224 × 224 × 3 picture.

S120, dividing the picture to be processed into

Each patch block has a height of 4 pixels and a width of 4 pixels.

Specifically, as shown in fig. 2, the picture to be processed is divided into non-coincident Patch sets by the Patch Partition, and the number of Patch blocks is

Taking the above example, if H takes 224 pixels and W takes 224 pixels, then the number of divided patch blocks in this step is 56 × 56 to 3136, where each patch has a size of 4 × 4, i.e., a height of 4 pixels and a width of 4 pixels. Then the plurality of patch blocks with the same size can be output as a 56 × 56 × 48 feature map (i.e., feature map), which is referred to as the original feature map in this embodiment. 48 becomes the characteristic dimension of the original feature map, and you can view it as a structure produced by stacking a plurality of two-dimensional pictures (i.e. patch blocks), and the characteristic dimension is 4 × 4 × 3 — 48.

And S130, inputting a plurality of patch blocks with the same size into a linear embedding layer as an original feature map, changing the feature dimension of the original feature map into a preset dimension C, converting the 2-dimensional original feature map into a 1-dimensional patch sheet, and taking the converted 1-dimensional patch sheet as a first feature map.

Specifically, taking the above example, the original feature map of 56 × 56 × 48 is input into the linear embedding layer. The Linear Embedding layer english name is Linear Embedding. The linear embedding layer can map the original feature map of 56 × 56 × 48 into a preset dimension C. The specific principle is to expand the H dimension and the W dimension in the 2-dimensional original feature map, so that the 2-dimensional original feature map is converted into a 1-dimensional patch slice, and a final feature structure (or called a feature vector) is generated as a first feature map.

It will be appreciated that the 56 x 48 raw feature map can be converted to a 56 x 96 first feature map.

In an embodiment of the present application, the S200 includes the following S210 to S230:

s210, inputting the first characteristic diagram into a Wiggle Transformer Block, and performing a first Wiggle Transformer Block operation. In the process of carrying out first Wiggle Transformer Block operation, a W-MSA module is utilized to carry out window segmentation to generate a plurality of first windows, and each first window formed after segmentation is respectively subjected to attention calculation. S220, inputting the first feature diagram after the first Wiggle Transformer Block operation into the Wiggle Transformer Block, and performing a second Wiggle Transformer Block operation. In the process of performing the second Wiggle Transformer Block operation, a WWL-MSA module is used for performing window segmentation to generate a plurality of second windows, and an attention calculation is performed in each segmented second window.

And S230, inputting the first characteristic diagram after the second Wiggle Transformer Block operation into the Wiggle Transformer Block, performing a third Wiggle Transformer Block operation, and taking the finally obtained characteristic diagram as a second characteristic diagram. In the process of performing the third Wiggle Transformer Block operation, a WWR-MSA module is used for carrying out window segmentation to generate a plurality of third windows, and attention calculation is carried out in each segmented third window.

Specifically, S210-S230 include three consecutive Wiggle Transformer Block operations. As shown in FIG. 3, FIG. 3 is a Block diagram of the flow of three consecutive Wiggle Transformer Block operations. The window splitting module used in each Wiggle Transformer Block operation is different, which means that the window splitting mode is different. The W-MSA module has a window-type multi-headed attention module with no window-to-window interaction. The WWL-MSA module has a window-type multi-head attention module focusing on left-side window interaction. The WWR-MSA module has a window type multi-head attention module focusing on right window interaction. Their respective emphasis points are different.

Specifically, the window generated by window splitting through the W-MSA module is named as a first window. And a window generated by window segmentation by using a WWL-MSA module is named as a second window. And a window generated by window segmentation by using a WWR-MSA module is named as a third window. Hereinafter, the respective meanings of "first window", "second window", and "third window" will be explained with reference to the drawings, and will not be explained repeatedly.

S210, S220 and S230, each step comprises two steps, one step is window segmentation, and the other step is attention calculation in a window.

The window cutting is also divided into two steps, the first step is window cutting, and the second step is window recombination.

Fig. 4 is a schematic diagram of window cutting in the industrial occlusion target detection method according to an embodiment of the present application, which illustrates an example of window cutting, where the window cutting may indicate what a window is. In fig. 4, the minimum unit is the minimum grid, and the patch block framed by the dotted line range of the 4 × 4 minimum unit is a window, and it can be understood that fig. 4(a) has 3 × 3 — 9 windows.

The window cut may have different cutting patterns. Fig. 4(a) is a manner of forming a new window by taking a quarter of the window size at the center point of every 2 × 2 neighboring windows, and repeating this 4 times to obtain window 1, window 2, window 3, and window 4. The mode of fig. 4(b) is two rows at the outermost side in the horizontal direction, and a new window is formed by taking half the size of each window between every two adjacent windows, and this is repeated 4 times to obtain window 5, window 6, window 9 and window 11. The mode of fig. 4(c) is that two columns are outermost in the vertical direction, half of the window size is taken between every two adjacent window blocks to form a new window, and the above steps are repeated 4 times to obtain the window 7, the window 8, the window 11 and the window 12.

Three different window recombination modes of the W-MSA module, the WWL-MSA module and the WWR-MSA are shown in figures 5-7.

Window reorganization of the W-MSA module referring to FIG. 5, the main focus is on no window to window interaction. With reference to the above embodiment, 9 windows obtained by cutting the window in fig. 4 are randomly selected to be arranged into a matrix of 3 × 3, and then the attention calculation is performed in each of the 9 windows.

The window reorganization mode of the WWL-MSA module is shown in figure 6, and mainly focuses on the interaction with a left window. Bearing the above embodiment, 9 windows obtained by cutting the window in fig. 4 are randomly selected to be arranged into a matrix of 3 × 3, but the window 1, the window 2, the window 3, and the window 4 are in the lower right corner, and other windows are enclosed in the upper left corners of the window 1, the window 2, the window 3, and the window 4, which may indicate that the interaction with the left window is emphasized. Then, in 9 windows, the attention calculation is performed in each window.

The window reorganization mode of the WWR-MSA module is shown in figure 7, and mainly focuses on interaction with the right window. The multiple windows obtained by cutting the window in fig. 4 are randomly selected from 9 windows arranged in a matrix of 3 × 3, but the window 1, the window 2, the window 3, and the window 4 are in the upper left corner, and the other windows surround the lower right corners of the window 1, the window 2, the window 3, and the window 4, which may indicate that the interaction with the right window is emphasized. Then, in 9 windows, the attention calculation is performed in each window.

In the embodiment, in three continuous WiggleTransformer Block operations, a W-MSA module, a WWL-MSA module and a WWR-MSA module are respectively adopted to perform window segmentation. Different window positions can be extracted by different MSA modules, more cross-window connections are realized, interaction between windows is increased, and the global property and robustness are improved.

As shown in fig. 3, in an embodiment of the present application, the S210 includes the following S211 to S217:

s211, inputting the first characteristic diagram into a Wiggle Transformer Block, and carrying out Layer Normalization operation.

Specifically, S211 to S217 specifically describe the procedure of the first Wiggle Transformer Block operation, which is the first column operation in fig. 3. First, in S211, a Layer Normalization operation, abbreviated as LN operation, is performed on the first feature map. For convenience of description, the first characteristic diagram is denoted as Z^x-1。

The role of the Layer Normalization operation is on Z^x-1And (6) carrying out normalization.

S212, carrying out W-MSA module window segmentation on the feature map subjected to the Layer Normalization operation to generate a plurality of first windows, and carrying out attention calculation on each first window formed after segmentation to generate the feature map subjected to W-MSA module window segmentation and attention calculation.

Specifically, the feature map after the Layer Normalization operation is the first feature map after the Layer Normalization operation in S211.

The window splitting mode of the W-MSA module is shown in FIG. 5.

When the W-MSA module is subjected to window segmentation, the number of the heads of Multi-Head self Attention (Multi-Head Attention) is 3, K, Q, V are respectively a Tensor with the length of 3, the dimension of each Tensor is (56, 56, 96), and the window size is 7 multiplied by 7. Because of the canvas size limitation, the size of each window illustrated in FIG. 4 is 4 × 4, rather than 7 × 7. Then, in the above embodiment, the number of windows is 56/7 × 56/7 to 8 × 8 to 64.

In the feature map obtained by window segmentation, each 7 × 7 window performs Self-orientation (i.e. the aforementioned orientation calculation) inside itself, and adds the relative position code B when Q, K in the formula of the original orientation calculation, thereby further improving the model performance.

The Attention of the present application calculates the formula as follows:

wherein, Q is a Query vector, K is a Key vector, V is a Value vector, and d is a parameter normalized by score. B is a relative position code newly added in the application, and is a parameter which can be self-learned.

After the Self-execution of Self-orientation in each 7 x 7 window, the whole complete process of window segmentation is completed, and the generated characteristic diagram is marked as

The first characteristic diagram is obtained by firstly carrying out Layer Normalization operation and then carrying out W-MSA module window segmentation.

S213, residual error connection is carried out on the feature diagram after the W-MSA module window segmentation and the attention calculation and the first feature diagram.

Specifically, the step is to

And Z^x-1And performing residual connection, wherein the residual connection is used for searching the relation between the characteristic diagram obtained after the window of the W-MSA module is segmented and the characteristic diagram before the window of the W-MSA module is segmented.

And S214, carrying out Layer Normalization operation on the feature map obtained after residual errors are connected.

Specifically, the feature map obtained after residual concatenation is again subjected to an LN operation.

S215, inputting the characteristic diagram after the Layer Normalization operation into a 2-Layer MLP neural network module for neural network processing.

Specifically, the "signature after Layer Normalization operation" in this step is the output result of S214. In this step, the output result of S214 is input to a 2-layer MLP neural network module for neural network processing.

The 2-layer MLP neural network module is a 2-layer multilayer perceptron.

And S216, performing convolution for three times on the feature map processed by the neural network.

Specifically, the feature map processed by the neural network is input to a structure of 3-layer convolution to be convoluted for three times. Convolution kernels for the three convolutions are 3 × 3, 5 × 5 and 1 × 1, respectively. Convolution may make the feature map more focused on the extraction of local detail features.

And S217, residual error connection is carried out on the feature graph obtained after the feature graph after the three times of convolution and the residual error connection again, and finally the first feature graph after the first Wiggle Transformer Block operation is obtained. The convolution kernels for the three convolutions are 3 × 3, 5 × 5 and 1 × 1, respectively.

Specifically, this residual error connection corresponds to performing residual error connection on the output result of S213 and the output result of S216, and finally obtaining a first feature map after a first Wiggle Transformer Block operation, which is denoted as Z^x。

In the embodiment, a 3-layer containment module is added behind the MLP module of each block, so that the situation that a target is obscured by an obstructing object and is mixed up in a background is overcome while the global characteristics are emphasized, the interaction of the global characteristics in a larger receptive field is realized while the local characteristics are emphasized, and the accuracy of the detection of the obstructed target is improved. The method has the characteristics of fully considering the displacement invariance of the CNN, the relationship between the receptive field and the hierarchy and the like, and effectively combines the respective advantages of the CNN and the transformer.

As shown in fig. 3, in an embodiment of the present application, the step S220 includes the following steps S221 to S227:

s221, inputting the first feature map subjected to the first Widget Transformer Block operation into the Widget Transformer Block, and performing Layer Normalization operation.

S222, carrying out WWL-MSA module window segmentation on the feature map subjected to the Layer Normalization operation to generate a plurality of second windows, and carrying out attention calculation on each second window formed after segmentation to generate the feature map subjected to WWL-MSA module window segmentation and attention calculation.

And S223, performing residual error connection on the feature diagram obtained after the WWL-MSA module window segmentation and the attention calculation and the first feature diagram obtained after the first Wiggle transform Block operation.

And S224, carrying out Layer Normalization operation on the feature map obtained after residual errors are connected.

And S225, inputting the characteristic diagram after the Layer Normalization operation into a 2-Layer MLP neural network module for neural network processing.

And S226, performing convolution for three times on the feature map processed by the neural network.

And S227, residual error connection is carried out on the feature graph obtained after the three times of convolution and the residual error connection again, and finally the first feature graph subjected to the second Wiggle Transformer Block operation is obtained. The convolution kernels for the three convolutions are 3 × 3, 5 × 5 and 1 × 1, respectively.

Specifically, S221 to S227 are operations in the second row in fig. 3, and the working principle of the operations is mostly consistent with that of S211 to S217, which is not described herein again, and the difference is that the modules adopted in window splitting are different, that is, the splitting forms adopted are different (actually, the window splitting modes are different). S222 adopts WWL-MSA module.

The WWL-MSA module window splitting manner is shown in fig. 6.

The W-MSA module realizes interaction among cross windows, the window after the mask contains elements of the original adjacent window, and the communication between the left windows is emphasized.

Z^xAfter executing S221 to S227, a first feature map after a second Wiggle Transformer Block operation is generated, which is marked as Z^x+1。

In an embodiment of the present application, the S230 includes the following S231 to S237:

s231, inputting the first feature map subjected to the second Widget Transformer Block operation into the Widget Transformer Block, and performing Layer Normalization operation.

S232, carrying out WWR-MSA module window segmentation on the feature map subjected to Layer Normalization operation to generate a plurality of third windows, and carrying out attribute calculation on each third window formed after segmentation to generate the feature map subjected to WWR-MSA module window segmentation and attribute calculation.

And S233, residual error connection is carried out on the feature diagram obtained after WWR-MSA module window segmentation and attention calculation and the first feature diagram obtained after second Wiggle transform Block operation.

And S234, carrying out Layer Normalization operation on the feature map obtained after the residual errors are connected.

S235, inputting the characteristic diagram after the Layer Normalization operation into a 2-Layer MLP neural network module for neural network processing.

And S236, performing convolution on the feature map processed by the neural network for three times.

And S237, residual error connection is carried out on the feature map obtained after the feature map subjected to the three-time convolution and the residual error connection again, and finally a second feature map is obtained. Convolution kernels for the three convolutions are 3 × 3, 5 × 5 and 1 × 1, respectively.

Specifically, S231 to S237 are operations in the third column in fig. 3, and the operation principle thereof is mostly consistent with the operation principle of S211 to S217, and the operation principle of S221 to S227 is not described herein again, but the difference is that the modules adopted during window segmentation are different, that is, the segmentation forms are different. S232 adopts WWR-MSA module. And the WWR-MSA module carries out mask at different positions on the feature map on the basis of W-MSA module division to obtain a window after 'shifting', wherein the size of the mask is M/2, and the number of windows can be kept unchanged.

The WWR-MSA module window splitting manner is shown in fig. 7.

The WWR-MSA module realizes interaction among cross windows, the window after the mask contains elements of the original adjacent window, and the communication between the right windows is emphasized.

Z^x+1After executing S231 to S237, a second characteristic diagram is generated, which is marked as Z^x+2。

In summary, after three consecutive Wiggle transform Block operations are performed, Z^x-1Is changed into Z^x+2However, the resolution remains at 224/4 for H and 224/4 for W.

The subsequent S300, S400, and S600 all have the same steps as S200, i.e., "perform three consecutive wiggletransducer Block operations", noting that their working principles are completely consistent. Just before performing three consecutive wiggler transform Block operations, S300, S400, and S600 all need to perform a "retry operation", and the procedure of the retry operation is explained in detail through S310 to S340.

In an embodiment of the present application, the S300 includes the following S310 to S330:

and S310, inputting the second feature map into a Patch gathering module.

And S320, selecting 2 multiplied by 2Patch blocks at intervals in the height direction and the width direction of the second feature map for splicing through the Patch Merging module.

S330, performing convolution with convolution kernel of 1 × 1 on the spliced second feature map, and finally generating the second feature map subjected to the reconfiguration operation.

Specifically, the "restart operation" includes two steps, the first step being a splicing step in S320, and the second step being a convolution step in S330. The "structure operation" is actually equivalent to a process of down-sampling in terms of processing principle.

The splicing process of S320 is shown in fig. 8, and may be understood as dividing the plurality of patch blocks in the second feature map into a plurality of groups, as shown in fig. 8(a), the upper left corner of fig. 8(a) is a group of patch blocks, and fig. 8(a) has 4 × 4 — 16 groups of patch blocks in total. Each set includes 4 adjacent patch blocks arranged in a 2 x 2 matrix. Then 4 patch pieces in each group are split and divided into an upper left-corner patch piece, an upper right-corner patch piece, a lower left-corner patch piece and a lower right-corner patch piece. And (4) taking out the upper left corner patch blocks from each group, and splicing all the upper left corner patch blocks to generate the patch blocks of the graph (b). And (4) taking out the upper right-corner patch blocks from each group, and splicing all the upper right-corner patch blocks to generate the patch block of the graph (c). And (d) taking out the lower left corner patch blocks from each group, and splicing all the lower left corner patch blocks to generate the patch blocks of the graph (d). And (4) taking out the lower right corner patch blocks from each group, and splicing all the lower right corner patch blocks to generate the patch blocks of the graph (e). Finally, the patch of fig. (b), the patch of fig. (c), the patch of fig. (d), and the patch of fig. (e) are stacked to generate a second feature map after stitching, and then the following step S330 is performed.

Receive the above embodiment, then Z^x+2After splicing, the characteristic diagram of 28H, 28W and 384 characteristic dimensions is obtained. Upon convolution, the feature dimension 384 is again reduced to 192.

After executing S310 to S340, the second feature map after the retry operation is obtained. And then performing three continuous WiggleTransformer Block operations on the second characteristic diagram after the Restructure operation (the principle is consistent with that of S210 to S230). Wherein, the number of the Head of the Multi-Head Attention (Multi-Head Attention) is 6, and the tenor dimension composed of Q, K and V is (28, 28 and 192). In this case, the number of windows is 28/7 × 28/7, that is, a total of 4 × 4 to 16 windows. The final output third feature map is (28, 28, 192), i.e. the resolution of the final third feature map is kept at H/8, W/8.

In an embodiment of the present application, the S400 includes the following S410 to S470:

s410, inputting the third feature map into the Patch Merging module to execute a Restructure operation.

Specifically, in the same manner as the downsampling in S310 to S340, each set of 2 × 2 adjacent patch blocks in the third feature map is first spliced, and then convolution with a convolution kernel of 1 × 1 is performed.

After each group of 2 × 2 adjacent patch blocks in the third feature map are spliced, the feature map is changed to (14, 14, 768), the feature dimension is reduced from 768 to 384 through convolution with a convolution kernel of 1 × 1.

And S420, inputting the third characteristic diagram after the response operation into a Wiggle Transformer Block, and performing a first Wiggle Transformer Block operation. In the process of carrying out first Wiggle Transformer Block operation, a W-MSA module is utilized to carry out window segmentation to generate a plurality of first windows, and each first window formed after segmentation is respectively subjected to attention calculation.

And S530, inputting the third feature map subjected to the first Wiggle Transformer Block operation into the Wiggle Transformer Block, and performing a second Wiggle Transformer Block operation. In the process of carrying out the second Wiggle Transformer Block operation, a WWL-MSA module is utilized to carry out window segmentation to generate a plurality of second windows, and attention calculation is carried out in each segmented second window.

S440, inputting the third feature map subjected to the second Wiggle Transformer Block operation into the Wiggle Transformer Block, and performing the third Wiggle Transformer Block operation. And in the process of carrying out third Wiggle Transformer Block operation, carrying out window segmentation by using a WWR-MSA module to generate a plurality of third windows, and carrying out attention calculation in each segmented third window.

S450, inputting the third feature map subjected to the third Wiggle Transformer Block operation into the Wiggle Transformer Block, and performing the second Wiggle Transformer Block operation. In the process of carrying out the second Wiggle Transformer Block operation, a W-MSA module is used for carrying out window segmentation to generate a plurality of first windows, and attention calculation is carried out in each first window formed after segmentation.

And S460, inputting the third feature map subjected to the second Wiggle Transformer Block operation into the Wiggle Transformer Block, and performing the second Wiggle Transformer Block operation. And in the process of carrying out second Wiggle Transformer Block operation, carrying out window segmentation by using a WWL-MSA module to generate a plurality of second windows, and carrying out attention calculation in each segmented second window.

And S470, inputting the fourth feature map subjected to the second Wiggle Transformer Block operation into the Wiggle Transformer Block, performing the second Wiggle Transformer Block operation, and taking the finally obtained feature map as the fourth feature map. And in the process of carrying out the second third Wiggle Transformer Block operation, carrying out window segmentation by using a WWR-MSA module to generate a plurality of third windows, and carrying out attention calculation in each segmented third window.

Specifically, in the present embodiment, six consecutive wiggletransducer Block operations are performed instead of three, but the principle is consistent with that of three consecutive wiggletransducer Block operations, that is, two consecutive wiggletransducer Block operations, each having three consecutive wiggletransducer Block operations.

The number of heads of Multi-Head Attention (Multi-Head Attention) is 12, and the tenor dimension composed of Q, K and V is (14, 14 and 384). In this case, the number of windows is 14/7 × 14/7, that is, 2 × 2 windows is 16 windows in total. The resolution of the finally generated fourth feature map is kept at H/16 and W/16.

Further, the step of S500 is consistent with the principle of S300 described above. Firstly, in the reconstruction operation, after each group of 2 × 2 adjacent patch blocks in the fourth feature map are spliced, the feature map is changed to (7, 7, 1536), and the feature dimension is reduced from 1536 to 768 through convolution with a convolution kernel of 1 × 1.

In the window division in S500, the number of heads of the Multi-Head self Attention (Multi-Head Attention) is 24, and the tenor dimension composed of Q, K, and V is (7, 7, 768). In this case, the number of windows is 7/7 × 7/7, i.e., only 1 window remains. The resolution of the finally generated fourth feature map is kept at H/32 and W/32.

In an embodiment of the present application, the S600 includes the following S610 to S620:

and S610, sending the fifth feature map into the RPN network.

And S620, performing binary classification on the fifth feature map through the RPN to obtain a classification loss value.

Specifically, the class loss value is a value representing the element type of the occlusion part. For example, if the bird classification loss value is 0, the tree classification loss value is 1, and the human classification loss value is 2, the element type of the occlusion part can be determined according to the obtained classification loss values.

In an embodiment of the present application, the S600 further includes the following steps:

s630, performing Bounding Box regression on the fifth feature map through an RPN network to obtain a regression loss value.

Specifically, the regression loss value is also a value.

Aiming at the complexity and uncertainty of the target shielding, the invention designs a new network module, adds a WWL-MSA module, a WWL-MSA and a 3-layer convolution layer structure to overcome the problem that the shielding object obscures the target detection, and increases the interaction between windows to improve the global property and the robustness. The respective attention is calculated in the window, which has the advantage of introducing both the limitation of CNN convolution operation and on the other hand saving the amount of calculation. And through the combination of the patch, the resolution can be reduced, the number of channels can be adjusted to form a hierarchical design, and a certain amount of calculation can be saved. The method can reduce the influence of the shielding object on the target to be detected, and can increase detail characteristics in multiple levels so as to improve the accuracy of the detection of the shielded target.

The technical features of the embodiments described above may be arbitrarily combined, the order of execution of the method steps is not limited, all possible combinations of the technical features in the embodiments described above are not described for simplicity of description, and the combinations of the technical features are considered to be within the scope of the description of the present specification as long as there is no contradiction between the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. An industrial occluded target detection method, characterized in that the method comprises:

carrying out three times of continuous Wiggle Transformer Block operations on the first characteristic diagram to generate a second characteristic diagram;

inputting the second characteristic diagram into a Patch gathering module to execute a structure operation, and carrying out three times of continuous WiggeTransformer Block operation on the second characteristic diagram after the structure operation to generate a third characteristic diagram;

inputting the third characteristic diagram into a Patch gathering module to execute a resolution operation, and carrying out six continuous WiggeTransformer Block operations on the third characteristic diagram after the resolution operation to generate a fourth characteristic diagram;

sending the fifth feature map into an RPN network, carrying out occlusion target detection on the fifth feature map through the RPN network, and outputting an occlusion target detection result;

in three times of continuous WiggleTransformer Block operation, a W-MSA module, a WWL-MSA module and a WWR-MSA module are respectively adopted for window segmentation, and attention calculation is respectively carried out in each window formed after segmentation.

2. The method according to claim 1, wherein the inputting a to-be-processed picture, performing patch segmentation on the to-be-processed picture, and performing dimensionality reduction by applying a linear embedding layer to obtain a first feature map comprises:

inputting a picture to be processed with the height of H, the width of W and the number of channels of 3;

dividing pictures to be processed into

Each patch block is 4 pixels in height and 4 pixels in width;

inputting a plurality of patch blocks with the same size as an original feature map into a linear embedding layer, changing the feature dimension of the original feature map into a preset dimension C, converting the 2-dimensional original feature map into a 1-dimensional patch sheet layer, and taking the converted 1-dimensional patch sheet layer as a first feature map.

3. The industrial occluded target detection method of claim 2, wherein the performing three consecutive Wiggle transform Block operations on the first feature map and generating the second feature map comprises:

inputting the first characteristic diagram into a Wiggle Transformer Block, and performing a first Wiggle Transformer Block operation; in the process of carrying out first Wiggle Transformer Block operation, carrying out window segmentation by using a W-MSA module to generate a plurality of first windows, and carrying out attention calculation in each first window formed after segmentation;

inputting the first characteristic diagram operated by the first Wiggle Transformer Block into the Wiggle Transformer Block, and carrying out second Wiggle Transformer Block operation; in the process of carrying out second Wiggle Transformer Block operation, carrying out window segmentation by using a WWL-MSA module to generate a plurality of second windows, and carrying out attention calculation in each second window formed after segmentation;

inputting the first characteristic diagram after the second Wiggle Transformer Block operation into the Wiggle Transformer Block, performing a third Wiggle Transformer Block operation, and taking the finally obtained characteristic diagram as a second characteristic diagram; in the process of carrying out third Wiggle Transformer Block operation, a WWR-MSA module is used for carrying out window segmentation to generate a plurality of third windows, and attention calculation is carried out in each segmented third window.

4. The industrial occluded target detection method of claim 3, wherein the inputting the first feature map into a Wiggle Transformer Block, and performing a first Wiggle Transformer Block operation, comprises:

inputting the first characteristic diagram into a Wiggle Transformer Block, and performing Layer Normalization operation;

carrying out W-MSA module window segmentation on the feature map subjected to the Layer Normalization operation to generate a plurality of first windows, and carrying out attribute calculation on each first window formed after segmentation to generate the feature map subjected to W-MSA module window segmentation and attribute calculation;

residual error connection is carried out on the characteristic diagram after W-MSA module window segmentation and attention calculation and the first characteristic diagram;

carrying out Layer Normalization operation on the feature diagram obtained after residual errors are connected;

inputting the characteristic diagram after the Layer Normalization operation into a 2-Layer MLP neural network module for neural network processing;

carrying out three times of convolution on the characteristic diagram processed by the neural network;

residual error connection is carried out on the feature graph obtained after the feature graph subjected to the three-time convolution and the residual error connection again, and finally a first feature graph subjected to a first Wiggle transform Block operation is obtained; convolution kernels for the three convolutions are 3 × 3, 5 × 5 and 1 × 1, respectively.

5. The method for detecting the industrial blocked target according to claim 4, wherein the step of inputting the first feature map operated by the first Wiggle Transformer Block into the Wiggle Transformer Block and performing the second Wiggle Transformer Block operation comprises:

inputting the first characteristic diagram subjected to the first Widget Transformer Block operation into the Widget Transformer Block, and performing Layer Normalization operation;

carrying out WWL-MSA module window segmentation on the characteristic diagram subjected to Layer Normalization operation to generate a plurality of second windows, and carrying out attribute calculation on each second window formed after segmentation to generate a characteristic diagram subjected to WWL-MSA module window segmentation and attribute calculation;

carrying out residual error connection on the characteristic diagram after WWL-MSA module window segmentation and attention calculation and the first characteristic diagram after first Wiggle Transformer Block operation;

residual error connection is carried out on the feature graph obtained after the feature graph subjected to the three-time convolution and the residual error connection again, and finally the first feature graph subjected to the second Wiggle Transformer Block operation is obtained; convolution kernels for the three convolutions are 3 × 3, 5 × 5 and 1 × 1, respectively.

6. The method for detecting an industrial blocked target according to claim 5, wherein the step of inputting the first feature map subjected to the second Wiggle Transformer Block operation into a Wiggle Transformer Block, performing a third Wiggle Transformer Block operation, and taking the finally obtained feature map as a second feature map comprises:

inputting the first characteristic diagram subjected to the second Widget Transformer Block operation into the Widget Transformer Block, and performing Layer Normalization operation;

carrying out WWR-MSA module window segmentation on the feature map subjected to Layer Normalization operation to generate a plurality of third windows, and carrying out attribute calculation on each third window formed after segmentation to generate the feature map subjected to WWR-MSA module window segmentation and attribute calculation;

carrying out residual error connection on the characteristic diagram after WWR-MSA module window segmentation and attention calculation and the first characteristic diagram after second Wiggle Transformer Block operation;

connecting the feature graph after the third convolution with the residual error to obtain a feature graph, and connecting the residual error to obtain a second feature graph; convolution kernels for the three convolutions are 3 × 3, 5 × 5 and 1 × 1, respectively.

7. The industrial occlusion target detection method of claim 6, wherein the inputting the second feature map into the Patch Merging module to perform a Restructure operation comprises:

inputting the second feature map into a Patch measuring module;

selecting 2 multiplied by 2Patch blocks at intervals in the height direction and the width direction of the second feature map through a Patch measuring module for splicing to generate a spliced second feature map;

and performing convolution with convolution kernel of 1 × 1 on the spliced second feature map, and finally generating the second feature map subjected to the reconstruction operation.

8. The industrial occlusion target detection method according to claim 7, wherein the inputting the third feature map into a Patch measuring module to execute a resolution operation, and performing six consecutive WiggeTransformer Block operations on the third feature map after the resolution operation to generate a fourth feature map comprises:

inputting the third feature map into a Patch measuring module to execute a structure operation;

inputting the third characteristic diagram after the Restructure operation into a Wiggle Transformer Block, and performing a first Wiggle Transformer Block operation; in the process of carrying out first Wiggle Transformer Block operation, carrying out window segmentation by using a W-MSA module to generate a plurality of first windows, and carrying out attention calculation in each first window formed after segmentation;

inputting the third feature diagram operated by the first Wiggle Transformer Block into the Wiggle Transformer Block, and performing a second Wiggle Transformer Block operation; in the process of carrying out second Wiggle Transformer Block operation, carrying out window segmentation by using a WWL-MSA module to generate a plurality of second windows, and carrying out attention calculation in each second window formed after segmentation;

inputting the third feature diagram operated by the second Wiggle Transformer Block into the Wiggle Transformer Block, and performing third Wiggle Transformer Block operation; in the process of carrying out third Wiggle Transformer Block operation, carrying out window segmentation by using a WWR-MSA module to generate a plurality of third windows, and carrying out attention calculation in each third window formed after segmentation;

inputting the third feature diagram subjected to the third Wiggle Transformer Block operation into the Wiggle Transformer Block, and performing the first Wiggle Transformer Block operation for the second time; in the process of carrying out the second Wiggle Transformer Block operation, carrying out window segmentation by using a W-MSA module to generate a plurality of first windows, and carrying out attention calculation in each first window formed after segmentation;

inputting the third feature diagram subjected to the second Wiggle Transformer Block operation into the Wiggle Transformer Block, and performing the second Wiggle Transformer Block operation; in the process of carrying out second Wiggle Transformer Block operation for the second time, carrying out window segmentation by using a WWL-MSA module to generate a plurality of second windows, and carrying out attention calculation in each second window formed after segmentation;

inputting the third feature map subjected to the second Wiggle Transformer Block operation into the Wiggle Transformer Block, performing the second Wiggle Transformer Block operation, and taking the finally obtained feature map as a fourth feature map; and in the process of carrying out the second third Wiggle Transformer Block operation, carrying out window segmentation by using a WWR-MSA module to generate a plurality of third windows, and carrying out attention calculation in each segmented third window.

9. The industrial occluded target detection method of claim 8, wherein the sending the fifth feature map into the RPN network, detecting the occluded target of the fifth feature map through the RPN network, and outputting the result of detecting the occluded target comprises:

sending the fifth feature map into the RPN network;

and carrying out binary classification on the fifth feature map through the RPN to obtain a classification loss value.

10. The industrial blocked target detection method according to claim 9, wherein the sending the fifth feature map into the RPN network, performing blocked target detection on the fifth feature map through the RPN network, and outputting a blocked target detection result, further comprises:

and performing Bounding Box regression on the fifth feature map through an RPN to obtain a regression loss value.