CN117058382A

CN117058382A - Crack image segmentation method based on double encoders in complex environment

Info

Publication number: CN117058382A
Application number: CN202311029215.1A
Authority: CN
Inventors: 张建明; 曾志高; 王进; 王建新
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-14

Abstract

The invention discloses a crack image segmentation method based on double encoders in a complex environment, which is characterized in that on the basis of properly improving the whole frame of a double-path codec structure in the prior art CN202310525413.0, a new transducer block structure is also designed to effectively extract global semantic information of an image, a multi-window high-low frequency mechanism of Haer wavelet transformation is utilized to extract high-frequency characteristics and low-frequency characteristics of the image, and the local information perception capability and the interactivity between image blocks are enhanced, so that the problems that the complex environment is difficult to adapt and the calculation amount is large in the existing crack detection algorithm are solved; in addition, the invention also provides a brand new feature fusion module for better fusing each intermediate feature of the two encoders. The method can accurately realize crack detection in various complex environments, reduce the calculated amount and improve the environmental universality of crack detection.

Description

Crack image segmentation method based on double encoders in complex environment

Technical Field

The invention belongs to the field of image processing, and particularly relates to a crack image segmentation method based on double encoders in a complex environment.

Background

Cracks are a significant indicator of structural health and in the inspection process. Roads, bridges, and various concrete structures are subject to cracking with use, which is one of the most common diseases. However, due to various complex structures and greater security risks, manually detecting cracks is a time-consuming, labor-consuming, and extremely challenging task. With the continued development of deep learning, many excellent algorithms for crack detection and segmentation have been proposed successively. The automatic detection of the cracks can be realized, so that a lot of manpower and material resources can be saved, and subjective factors of human detection are avoided, so that the accuracy is improved.

The existing crack detection method can be divided into two major categories, one category adopts the traditional digital image processing technology, and the other category adopts a deep learning mode to establish a model so as to realize automatic crack detection. Traditional digital image technology has evolved for many years, and technology is also relatively mature. For example Salman et al propose crack detection methods based on Gabor filtering. Talab et al uses Sobel filter to remove noise in gray image, and OTSU method is used to complete crack detection. In the deep learning mode, a convolutional neural network is adopted to gradually extract and extract each scale feature of the image, and then the image is gradually restored through a decoder to obtain a segmentation mask. For example Choi et al propose sddnaet for segmenting cracks in real time. Jiang et al then proposed HDCB-Net for splitting bridge concrete cracks. In recent years, the transducer has achieved good results in various tasks of computer vision. The method has the characteristics that many convolutional neural networks do not have, and is input-adaptive, parallel-processing the whole input sequence, excellent in long-distance dependence modeling capability and the like. For example, wang et al have proposed a fracture splitting network SegCrack, and the encoder is implemented using a transducer.

However, the conventional digital image processing method has an advantage of high processing speed and low hardware requirements. But the robustness and the accuracy are not high, and the detection result is greatly influenced by the environment. Therefore, in practical use, it is difficult to adapt to complex environments. The split network built by the convolutional neural network often causes the receptive field to be continuously amplified due to the fact that the number of convolutional layers is excessively deep. Much of the detail information is lost and it is difficult to restore a fine area when a subsequent decoder restores an image. Whereas the cracks are mostly thin and long, which is disadvantageous for the split detection of cracks. Models based on transform (i.e., transducer) designs tend to be computationally intensive, which results in heavy computational burden and cost, and excessive attention to global information can also result in local detail being ignored. In the prior art, the inventor of the present invention previously applied for a chinese patent application CN202310525413.0 (patent publication No. CN 116563544A, publication date 2023.08.08), wherein a road crack segmentation method based on a convolutional neural network and a transformer dual path is disclosed, and the method can effectively improve the segmentation detection accuracy and efficiency of fine cracks, but mainly aims at road pavement crack environments, finds that the road pavement crack environments are difficult to adapt to more complex environments in practical application, have large calculation amount, and have weak environment universality of crack detection.

In order to facilitate accurate implementation of crack detection and reduce the calculation amount in various complex environments, an improvement and design of a new crack image segmentation method is urgently needed to improve the environmental versatility of crack detection.

Disclosure of Invention

First technical problem

Based on the method, the invention improves and designs a crack image segmentation method based on double encoders in a complex environment, and on the basis of properly improving the whole framework of a double-path codec structure in the prior art CN202310525413.0, the global semantic information of an image is effectively extracted by designing a new transducer block structure, and simultaneously, the high-frequency characteristic and the low-frequency characteristic of the image are extracted, and the local information perceptibility and the interactivity between image blocks are enhanced, so that the problems that the complex environment is difficult to adapt to and the calculation amount is large in the existing crack detection algorithm are solved, and the environmental universality of crack detection is improved; in addition, a brand new complementary feature fusion module is designed in the invention in a targeted manner to fuse each intermediate feature of the two encoders.

(II) technical scheme

The invention provides a crack image segmentation method based on double encoders in a complex environment, which is an improvement on (a) - (b) of an original overall framework of a road crack segmentation method based on a convolutional neural network and a transformer double path in patent number CN 202310525413.0:

(a) Dual encoder

The double encoder comprises a convolutional neural network encoder branch and a transducer encoder branch; in the branch of the transducer encoder, four transducer blocks are replaced by transducer blocks, an input image is firstly divided into a plurality of image blocks and flattened into an image sequence before being input into the branch of the transducer encoder, then the image sequence is subjected to linear projection once and then is input into a branch structure of the transducer encoder formed by combining layers of the four transducer blocks and the three image blocks, and the transducer blocks of each layer in the double encoder are repeatedly executed for three times;

(b) Decoder

Replacing all three converter blocks in the decoder with the converter blocks, wherein the converter blocks of each layer in the decoder are executed once;

the transducer block comprises a high-low frequency attention mechanism, a first summation and layer normalization, a local enhancement feedforward network and a second summation and layer normalization which are sequentially connected in series;

in the high-low frequency attention mechanism, firstly, adopting the standard in the haar wavelet transformationTransforming, filtering, then filtering in columns, and performing first-order decomposition to obtain three detail components and an approximate component; the image can be regarded as a discrete function with a value ranging from 0 to 1, and can then be represented by a scale functionThe sum multiplied by its coefficients represents the scale function as:

wavelet functionCan be represented by a scale function, i.e.> The wavelet decomposition of the first order is then denoted +.> Wherein C is ₀ Then the required approximation component; then, for C ₀ Up-sampling to original image size, subtracting the approximate component from original image to obtain needed detail component, obtaining a group of inquiry Q, key K and value V by linear transformation for the obtained approximate component and detail component, respectively, obtaining inquiry of approximate component by input, then calculating two different groups of Q, K and V by scaling dot product attention to obtain result SA of self-attention of scaling dot product of one head _h Performing linear transformation on the result obtained by splicing the plurality of heads once to obtain a multi-head self-attention result; when the feature dimension is N _h When the approximation component is separated into alpha N _h The detail component is divided into (1-alpha) N _h The high frequency characteristic and the low frequency characteristic are obtained respectively, are connected in the channel direction as output, and then are subjected to a residual structure of first addition and layer normalization with the input.

In another aspect, the present invention also discloses a crack image segmentation system based on dual encoders in a complex environment, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a dual encoder based fracture image segmentation method in a complex environment as described in any of the above.

In another aspect, the invention also discloses a non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the dual encoder based fracture image segmentation method in a complex environment as described in any one of the above.

(III) beneficial effects

(1) Firstly, the invention designs an integral framework of a crack segmentation network model of a double encoder path in an improved way, can adapt to various complex environments and has good calculation performance. The convolution layer in ResNet-50 is extracted to construct a convolution neural network encoder branch so as to extract local detail information of an image, and the method is excellent in extracting wide and obvious crack information. A transducer encoder branch is independently designed to extract global semantic information of an image, so that the capturing capability of a tiny crack can be improved. The extracted features of the improved double encoder are fully fused, the running multiply-add cumulative operand is less, the frame number is higher, and the method can be applied to practice more quickly.

(2) Secondly, the invention can be used as a plug and play component to extract image information by designing a brand new converter block structure. The transducer comprises a high-low frequency attention mechanism based on haar wavelet transformation, and can simultaneously extract high-frequency characteristics and low-frequency characteristics of an image. The local enhancement feedforward network is improved, interaction among all image blocks can be well enhanced, and the local information perception capability of the network model is enhanced.

(3) In addition, the invention also designs a brand new feature fusion module for complementarily fusing each intermediate feature of the two encoders. First, two different features are aligned so that they are in the same dimension. The weight of each channel is adjusted through the channel attention, and adverse and redundant channels are restrained. The correlation is then enhanced and domain fusion is performed. And finally, finishing aggregation of multiple types of features through a feature fusion block consisting of a series of convolution units, batch normalization and activation functions to obtain fusion features.

(4) A number of experiments were conducted to verify the performance of the present invention. Compared with the ten advanced networks at present, the two public fracture data sets of deep and Crack3238 have better comprehensive performance than other networks. The recall rate of the model is greatly improved, the detection rate of cracks in a complex environment is obviously improved, and an ablation experiment is carried out to verify the effectiveness of the method and the module provided by the invention.

Drawings

FIG. 1 is a block diagram showing the overall structure and flow of a dual encoder according to the present invention.

FIG. 2 is a block diagram of the architecture and flow chart of the transducer block of FIG. 1.

Fig. 3 is a structure and flow chart of a haar transform based high and low frequency attention mechanism in a transducer block.

FIG. 4 is a structure and flow chart of a locally enhanced feed forward network in a transducer block.

FIG. 5 is a block diagram of a complementary feature fusion module of FIG. 1.

Fig. 6 is a structure and flow chart of a cross domain fusion block in a feature fusion module.

Fig. 7 is an example of partitioning of the present invention on a deep mask data set.

FIG. 8 is an example of the segmentation of the present invention on a Crack3238 dataset.

Fig. 9 is an example of a feature map extracted from each intermediate layer of the different modules of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

In the prior art, CN202310525413.0 discloses a road crack segmentation method based on a convolutional neural network and a converter dual path, and the method can effectively improve the segmentation detection accuracy and efficiency of small cracks, but the inventor finds that the method is difficult to adapt to a complex environment and has larger calculation amount through practical application, so the invention is purposefully improved on the basis.

As shown in FIG. 1, the invention also constructs a new double-encoder path network by adopting the convolutional neural network and the transducer, wherein the double-encoder comprises a convolutional neural network encoder branch and a transducer encoder branch, and the whole structure of the original double-encoder in CN202310525413.0 is modified with a certain distinction, so that the brand-new transducer encoder branch can extract the semantic information of the image more comprehensively, help to improve the recall rate of the network and improve the detection rate and the calculation efficiency of tiny cracks in a complex environment. The invention adopts ResNet-50 widely used at present as a branch of the convolutional neural network encoder of the invention to extract local detail information of images, and can well extract wide and obvious crack information in various environments. The whole network can be downsampled layer by layer from top to bottom, and information of different scales is extracted. The complementary feature fusion module provided by the invention is embedded between each layer and is used for aggregating the features extracted by two encoder branches. The fusion characteristics of the last layer are sent to the decoder for decoding, and the fusion characteristics of other layers are sent to the corresponding layers of the decoder through jump connection, so that the recovery of the image information is facilitated.

Based on the overall framework of the prior art CN202310525413.0, this time the dual encoder section and decoder of the present invention shown in fig. 1 is improved in detail as follows:

(a) Convolutional neural network encoder branching

The invention adopts some layers in ResNet-50 widely used at present, extracts an initialization layer, a maximum pooling layer and a four-layer convolution layer in the existing network ResNet-50. The input image firstly passes through an initialization layer and a maximum pooling layer, then the local characteristics of the image are extracted when passing through four convolution layers, and the middle characteristics of each layer in the convolution layers 1-4 are reserved and output to four characteristic fusion modules and are fused with the middle characteristics corresponding to the branches of the transform encoder.

It should be noted that in the prior art CN202310525413.0, the convolutional layer 1 of the convolutional neural network encoder branch does not input intermediate features to the complementary fusion module, and the features of the depth separable convolutional input connected to the convolutional layer 4 are also input to a simple fusion module (which is different from the other three complementary fusion modules in structure), and the preprocessing of the input data through the initialization layer and the max pooling layer is not performed.

(b) Transformer encoder branching

Before the input image is input to the branch of the converter encoder, the input image is firstly divided into a plurality of image blocks and flattened into an image sequence, then linear projection is carried out once again, the branch structure of the converter encoder is consistent with the prior art, the input image is also composed of four converter blocks and three image block merging layers, namely, the image is input to one converter block after linear projection is carried out once, so that the output of the converter block and the output of the convolution layer 1 are input into one feature fusion module in parallel, then three converter decoding layers formed by the image block merging layers and the converter blocks are sequentially connected, the output of the three converter decoding layers sequentially corresponds to the convolution layers 2-4, and feature fusion is carried out through the other three feature fusion modules.

The newly designed transducer blocks consist of residual structures formed by a high-low frequency attention mechanism based on haar transformation and a local enhancement feedforward network, each transducer block in the encoder is repeatedly executed three times, namely, each layer of transducer block is repeatedly executed three times, and each 4 adjacent blocks are combined into one by an image block combining layer, so that a characteristic diagram with the same scale as each layer of the convolutional neural network is obtained.

(c) Decoder section

For the decoder part, the jump connection of the four feature fusion modules and the main structure of the decoder in the invention are consistent with those of the decoder part in the prior art CN202310525413.0, namely, the input of the feature fusion module corresponding to the convolution layer 4 at the bottom is taken as a basic image recovery signal to the decoder layer 1, the output of the three feature fusion modules corresponding to the convolution layer 3-convolution layer 1 is sequentially input into the decoder layer 1-decoder layer 3 connected in series, and finally, the result is output through the dividing head connected with the decoder layer 3. Wherein all three transducer blocks 5-7 in the original decoder layer are replaced by the transducer block of the present invention, the transducer block in the decoder is performed repeatedly three times, unlike the transducer block in the encoder, which is performed only once in a loop, and a segmentation header consisting of a sub-pixel convolution and a 1 x 1 convolution is specifically used in the decoder output section.

In order to more accurately realize crack detection and reduce the calculation amount in various complex environments, the invention specifically provides a crack image segmentation method based on a double encoder in the complex environment on the basis of the double encoder and the decoder in fig. 1, which specifically comprises the following steps:

1. data input

Step 1: and loading data, namely cutting an input image and enhancing the data.

Specifically, in step 1, during network training and testing, the invention performs proper preprocessing and data enhancement on the input image, so that the network can better divide the image. In the training stage, the invention cuts the input image and performs proper data enhancement. And sequentially carrying out size transformation, random horizontal direction overturn, random vertical direction overturn, random rotation and center cutting on the images. In the test stage, the image is only cut to the designated input size. And regularizes three channels of the image. The input image in the present invention can be cropped to 512×512 and 256×256.

2. Convolutional neural network encoder branching

Step 2: the convolutional neural network encoder branch encodes an input image.

Specifically, in step 2, the present invention employs some of the existing ResNet-50 layers to construct the convolutional neural network encoder branch of the present invention. The invention deconstructs ResNet-50, extracts the initialization layer, the maximum pooling layer and the four convolution layers. The input image will first pass through the initialization layer and the max-pooling layer, extracting local features of the image when passing through the four convolution layers. The intermediate features of each layer are reserved and output to a feature fusion module, and the feature fusion module is used for fusion with the intermediate features corresponding to the branches of the transducer encoder.

3. Transformer encoder branching

Step S3: the transform encoder branch encodes the input image.

Specifically, in step 3, before the image is input into the transform encoder branch, the image is first divided into a plurality of image blocks, flattened into an image sequence, and then subjected to a linear projection. While the transducer encoder branch consists mainly of transducer blocks and image block merging layers. The transform block is composed of a residual structure formed by two parts of the haar transform-based high-low frequency attention mechanism and the local enhancement feedforward network, which are proposed by the invention, as shown in fig. 2. The transducer block for each layer will be performed in duplicate three times. The image block merging layer merges every 4 adjacent blocks into one, so as to obtain a feature map with the same scale as each layer of the convolutional neural network.

Step 4: in each transducer block, the characteristic diagram sequentially enters a high-low frequency attention mechanism, a first summation and layer normalization, a local enhancement feedforward network and a second summation and layer normalization for processing.

As shown in fig. 2, the transducer block includes a high-low frequency attention mechanism, a first summation layer normalization, a local enhancement feedforward network, and a second summation layer normalization in series, this time replacing the full convolution high-low frequency attention module and feedforward network in the original transducer block with the high-low frequency attention mechanism and local enhancement feedforward network.

Specifically, in step 4, as shown in fig. 3, in the high-low frequency attention mechanism, standard transformation in haar wavelet transformation is adopted first, that is, filtering is performed first, then column filtering is performed, and first-level decomposition is performed to obtain three detail components and an approximate component; the image can be regarded as a discrete function with a value ranging from 0 to 1, and can then be represented by a scale functionThe sum multiplied by its coefficients represents; the scale function is:

wavelet functionCan be represented by a scale function, i.e.> The wavelet decomposition of the first order can then be expressed as +.> Wherein C is ₀ Then the approximate components required by the invention; then, for C ₀ Up-sampling to the original image size, and subtracting the approximate component from the original image to obtain a required detail component; for the obtained approximation component and detail component, obtaining a group of query (Q), key (K) and value (V) through linear transformation respectively; whereas a query of the approximation component is obtained at the input in order that the magnitude does not change after the self-attention is calculated. Two different sets of Q, K and V would then be injected by scaling the dot productThe meaning force is calculated:

wherein SA is _h Representing the result of the scaling dot product self-attention of a head, D _h Is a hidden layer dimension; and then, performing linear transformation on the result obtained by splicing the plurality of heads once to obtain a multi-head self-attention result. In order to reduce the calculation amount, the characteristic dimension is divided according to a certain proportion alpha while the approximate component and the detail component are obtained, and when the characteristic dimension is N _h When the approximation component is separated into alpha N _h The detail component is divided into (1-alpha) N _h Thereby fusing the high and low frequency characteristics of both multi-head self-attention and multi-head attention. And also divided into windows when calculating the attention, each window being calculated separately. Then, a high-frequency characteristic and a low-frequency characteristic are obtained, respectively, and connection is performed in the channel direction as an output. The residual structure is then subjected to a first addition and layer normalization with the input.

In step 4, for the results of the first addition and layer normalization, the inputs are then projected through the locally enhanced feed forward network into the high dimensional space to learn features that are not well captured in the low dimensional space and are more abstract, and then projected back into the original space.

Specifically, as shown in fig. 4, the present invention firstly performs spatial recovery on an input sequence to make it become the dimension of an image; then the dimension is lifted through depth separable convolution and GELU activation functions; and then, in a high-dimensional space, local features are captured by adopting depth separable convolution and GELU activation functions, so that interaction can be generated among blocks, and the network expression capability is enhanced. The convolution and gel activation functions and batch normalization are then again separable by inversion depth to recover to the original dimensions and flattened into one sequence, followed by a second addition with the input and layer normalized residual structure as output.

4. Feature fusion module

Step 5: and fusing the characteristics of each middle layer of the convolutional neural network coder branch and the transducer coder branch by a unified input characteristic fusion module.

Specifically, in step 5, the feature Cv extracted by the ith layer of the convolutional neural network encoder is considered _i Features T extracted from layer i of a transducer encoder _i Are inconsistent in the channel dimension. Thus, as shown in FIG. 5, the present invention first receives the intermediate feature Cv of the dual encoder input _i And T _i ，Cv _i Is an intermediate feature of the ith layer of the branch of the convolutional neural network coder, T _i Is an intermediate feature of the ith layer of the transducer encoder branch, and uses a convolution of 1 x 1 size to adjust Cv _i Is a channel dimension of (2); and then, the channel attention is adopted to adjust the response of each channel of the two characteristics, so that the weight of the favorable channel is improved, and the influence of the redundant channel is reduced. Through the series of processes, it can be obtainedAnd->Then (I)>And->And (5) performing matrix multiplication to enhance the correlation of the two to obtain correlation enhancement characteristics. Furthermore, the->And->A cross domain fusion module may also be entered to fuse information between different domains, as shown in fig. 6. The two different features are linearly transformed to obtain two sets of queries, keys and values, respectively. Each type of query then calculates multi-headed attention with the keys and values of the other type. Finally, the two results are connected in the channel direction and the dimension is reduced by a convolution of 1 x 1 sizeAnd extracting effective information to obtain cross domain fusion characteristics. Finally, the invention will link the relevance enhancing feature, the cross domain fusion feature, the +>And->And fourthly, reducing dimensionality through a feature fusion block, namely an inversion depth separable convolution, a batch normalization and a GELU activation function, fully fusing multiple types of features through the depth separable convolution, the batch normalization and the GELU activation function and a 1X 1 convolution, the batch normalization and the GELU activation function, and extracting effective information. Finally, outputting the fusion characteristic F of the layer _i 。

5. Decoder

Step 6: the decoder is composed of three layers of decoder layers and one dividing head in series, the decoder layer of each layer is composed of sub-pixel convolution, inverse depth separable convolution and transform block along the channel direction, the corresponding feature fusion module is input between the sub-pixel convolution and the inverse depth separable convolution of the decoder layer, and each transform block in the decoder layer is executed only once.

The fusion feature obtained by the last layer of the encoder serves as the initial input to the decoder. Each decoder layer is firstly up-sampled through sub-pixel convolution, then connected with fusion features of corresponding scales, subjected to dimension reduction through inversion depth separable convolution, and then subjected to feature aggregation and decoding by a transducer block. After passing through the three decoder layers, the splitting head is re-entered. The final segmentation mask is obtained by first up-sampling by sub-pixel convolution to the original size by a factor of 4 and then by a convolution of 1 x 1 size.

6. Environment setting

Step 8: the loss function in the model training phase uses the sum of binary cross entropy BCE loss and Dice loss, i.e., BCE plus Dice:

wherein N represents the total number of pixels in the image, t _i Then it is the true class of pixel, p _i Then it is the network predicted pixel class and epsilon is the smoothing factor to prevent zero removal.

The batch size of the training phase was set to 24, and a total of 70 groups were trained. The optimizer selects AdamW, the initial learning rate is 0.001, and the scheduler adopts a cosine annealing strategy. The whole model is implemented by using Python 3.10 and Pytorch 1.11.0. A graphics card with Nvidia GTX 3090 24GB memory and a CPU of Intel (R) Xeon (R) platform 8350C were selected.

It is noted that step 8 is not an essential step in the encoder and decoder and that steps 2 and 3 may be performed in parallel, just to make the trained network more robust.

As can be seen from the above steps 1-8, in order to adapt to the crack detection of the complex environment and reduce the calculation amount, the present invention constructs a new dual encoder path network by using the convolutional neural network and the transducer together, and the running multiply-add cumulative operand is less and the frame number is higher. The transducer encoder branch can extract the semantic information of the image more comprehensively, help to improve the recall rate of the network and improve the detection rate of the tiny cracks. The invention adopts ResNet-50 widely used at present as a branch of the convolutional neural network encoder of the invention to extract local detail information of the image, and can well extract wider and obvious crack information. The whole network can be downsampled layer by layer from top to bottom, and information of different scales is extracted. The feature fusion module with better complementary performance is embedded between each two layers and is used for aggregating the features extracted by the two encoder branches. The fusion characteristics of the last layer are sent to the decoder for decoding, and the fusion characteristics of other layers are sent to the corresponding layers of the decoder through jump connection, so that the recovery of the image information is facilitated.

To illustrate the beneficial effects of the method of the present invention, the following further describes the advantages of the dual encoder structure and crack image segmentation method of the present invention in detail in connection with the experimental effect graphs based on fig. 7-9 and tables 1-6 and example 1 below:

example 1

Example 1 was developed using the Pytorch framework throughout, and the relevant configuration is described as follows: the batch size for the training phase was set to 24 for a total of 70 groups trained. The optimizer selects AdamW, the initial learning rate is 0.001, and the scheduler adopts a cosine annealing strategy. The whole model is implemented by using Python 3.10 and Pytorch 1.11.0. A graphics card with Nvidia GTX 3090 24GB memory and a CPU of Intel (R) Xeon (R) platform 8350C were selected.

First, to demonstrate the optimal number of executions of the encoder per layer of the Transformer block, a comparative experiment was performed on the deep data set. The present invention was compared once, twice and three times, and the results are shown in table 1.

TABLE 1 comparison of the number of different Transformer block executions

Number of executions	Precision of	Recall rate of recall	F1 score	Average cross-over ratio
					×1	0.8317	0.8945	0.8620	0.7459
×2	0.8339	0.9081	0.8694	0.7415
					×3	0.8287	0.9270	0.8751	0.7523

In addition, the invention also selects ResNet networks of different versions to form branches of a convolutional neural network encoder, and the invention discovers that the ResNet-50 has the best comprehensive performance, and the result is shown in a table 2.

TABLE 2 comparison of different convolutional neural network encoder branches

Model	Precision of	Recall rate of recall	F1 score	Average cross-over ratio
					ResNet-18	0.6853	0.9178	0.7847	0.6641
ResNet-34	0.7878	0.9008	0.8405	0.7077
					ResNet-50	0.8287	0.9270	0.8751	0.7523
ResNet-101	0.8926	0.8258	0.8579	0.7284
					ResNet-152	0.8537	0.8772	0.8653	0.7310

And the present invention also verifies the validity of the dual path encoder by eliminating one of the encoder paths. Experiments show that the recall rate of the network can be effectively improved by adopting the double-path encoder structure, and the overall performance is obviously improved. The results are shown in Table 3.

Table 3 convolutional neural network encoder versus transform encoder effectiveness comparison

To better illustrate the effectiveness of the newly designed feature fusion module, ablation experiments were performed as shown in table 4. Experiments were performed on the effectiveness of the main individual basic units therein. It can be seen that the network is best when all three main operations are performed. And it can be found that channel attention can significantly increase the recall rate of the network. In addition, it can be seen that the computation of the cross domain fusion block is relatively large, and when subtracting this module, the computation and execution speed of the network is significantly increased.

TABLE 4 comparison of the effectiveness of the constituent units within the feature fusion module

Channel attention	Correlation enhancement	Cross domain fusion block	Precision of	Recall rate of recall	F1 score	Average cross-over ratio	Frame per second ≡
											0.8786	0.7849	0.8291	0.6863	38.29
√			0.8439	0.8666	0.8551	0.7264	35.76
									√		0.8496	0.8320	0.8407	0.7076	36.61
		√	0.8686	0.8033	0.8347	0.7027	31.02
								√		√	0.8260	0.8667	0.8459	0.7227	32.68
	√	√	0.8213	0.8835	0.8513	0.7240	30.67
								√	√		0.8748	0.8409	0.8575	0.7318	35.64
√	√	√	0.8287	0.9270	0.8751	0.7523	30.02

In order to verify the advancement of the present invention, the overall performance of the method of the present invention was seen to be superior to other methods by comparing the two disclosed datasets with the ten advanced methods currently. The results are shown in tables 5 and 6.

TABLE 5 comparison of models on DeepCrack dataset

Model	Precision of	Recall rate of recall	F1 score	Average cross-over ratio
					UNet-ResNet34	0.8568	0.7912	0.8227	0.6727
Attn-UNet	0.8612	0.7975	0.8282	0.6874
					DeepLabv3+	0.8893	0.7241	0.7982	0.6818
CENet	0.9024	0.7519	0.8203	0.6754
					TransFuse	0.9253	0.6624	0.7721	0.6076
UTNet	0.8930	0.8239	0.8570	0.7387
					FAT-Net	0.8857	0.8052	0.8436	0.7106
DeepCrack	0.8842	0.8236	0.8528	0.7191
					DcsNet	0.9104	0.6961	0.7889	0.6287
DTrC-Net	0.9403	0.6651	0.7791	0.6324
					The invention is that	0.8287	0.9270	0.8751	0.7523

TABLE 6 comparison of models on the Crack3238 dataset

Model	Precision of	Recall rate of recall	F1 score	Average cross-over ratio
					UNet-ResNet34	0.7232	0.7314	0.7121	0.5878
Attn-UNet	0.7555	0.7260	0.7278	0.6039
					DeepLabv3+	0.6783	0.7874	0.7155	0.5837
CENet	0.7475	0.7470	0.7332	0.6061
					TransFuse	0.7318	0.7256	0.7112	0.5826
UTNet	0.7409	0.7561	0.7368	0.6149
					FAT-Net	0.7507	0.7433	0.7347	0.6131
DeepCrack	0.7197	0.7372	0.7138	0.5870
					DcsNet	0.7397	0.7488	0.7307	0.6073
DTrC-Net	0.7560	0.7886	0.7644	0.6430
					The invention is that	0.7826	0.7902	0.7864	0.5869

To demonstrate the utility of the method and its efficiency, the present invention was compared to other models. Although the method is not optimal in the comparison, it is also at mid-stream level and the network is actually the best, and the results are shown in table 7. The number of frames per second can also meet the practical application requirements, and the segmentation performance is far superior to that of other models.

Table 7 comparison of the performance of the models

Model	Multiply add accumulate operand (G) ∈	Parameter number (M) ∈	Frame per second ≡
				UNet-ResNet34	90.13	69.30	41.07
Attn-UNet	135.34	57.16	28.94
				DeepLabv3+	22.14	59.23	30.64
CENet	8.90	29.00	32.96
				TransFuse	63.40	143.39	18.52
UTNet	5.19	3.62	31.59
				FAT-Net	42.80	29.62	28.00
DeepCrack	136.80	30.91	27.35
				DcsNet	15.79	25.85	38.19
DTrC-Net	123.20	63.45	28.78
				The invention is that	26.08	68.86	30.02

In addition, referring to fig. 7-9, the crack image segmentation method and structure based on the dual encoders in the complex environment have good performance on the data set in each complex environment, and the features of the dual encoders have better complementation and calculation characteristics, have stronger universality, and can be well applied to the crack image segmentation conditions in various environments.

It should also be noted that the crack image segmentation method of the present invention may be converted into software program instructions, which may be implemented by running a software analysis system including a processor and a memory, or may be implemented by computer instructions stored in a non-transitory computer readable storage medium.

Finally, the method of the present invention is only a preferred embodiment and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The crack image segmentation method based on the double encoders in the complex environment is characterized in that the crack image segmentation method is an improvement in (a) - (b) of the original overall framework of the road crack segmentation method based on the convolutional neural network and the double paths of the transducer in patent number CN 202310525413.0:

(a) Dual encoder

(b) Decoder

in the high-low frequency attention mechanism, standard transformation in haar wavelet transformation is adopted, filtering is carried out firstly, column filtering is carried out secondly, and first-level decomposition is carried out to obtain three detail components and an approximate component; the image can be regarded as a discrete function with a value ranging from 0 to 1, and can then be represented by a scale functionThe sum multiplied by its coefficients represents the scale function as:

wavelet functionCan be represented by a scale function, i.e.> The wavelet decomposition of the first order is then denoted +.> Wherein C is ₀ Then the required approximation component; then, for C ₀ Up-sampling to original image size, subtracting the approximate component from original image to obtain needed detail component, linear transforming to obtain a group of inquiry Q, key K and value V, obtaining inquiry of approximate component from input, calculating two different groups of Q, K and V by scaling dot product attention to obtain scaling dot product self-attention of a headResults SA of (2) _h Performing linear transformation on the result obtained by splicing the plurality of heads once to obtain a multi-head self-attention result; when the feature dimension is N _h When the approximation component is separated into alpha N _h The detail component is divided into (1-alpha) N _h The high frequency characteristic and the low frequency characteristic are obtained respectively, are connected in the channel direction as output, and then are subjected to a residual structure of first addition and layer normalization with the input.

2. The method for splitting crack images based on double encoders in a complex environment according to claim 1, wherein in the convolutional neural network encoder branches of the double encoders, an initialization layer, a maximum pooling layer and four-layer convolutional layers 1-4 in a network ResNet-50 are extracted, an input image sequentially extracts local features of the image when passing through the initialization layer and the maximum pooling layer and then the four-layer convolutional layers 1-4, and intermediate features of each layer in the convolutional layers 1-4 are reserved and output to four feature fusion modules and fused with intermediate features corresponding to the convolutional encoder branches.

3. The method for splitting a crack image based on a double encoder in a complex environment according to claim 1, wherein the input image is cut and data is enhanced during data loading, and the image is sequentially subjected to size transformation, random horizontal direction overturn, random vertical direction overturn, random rotation and center cutting; in the test stage, the image is only cut to the specified input size, and three channels of the image are regularized.

4. The method for splitting a crack image based on a dual encoder in a complex environment according to claim 2, wherein, in the transform block, for the output result of the first addition and layer normalization, the input is projected to a high-dimensional space through local enhancement feedforward network and second addition and layer normalization to learn features which are not well captured in a low-dimensional space and are more abstract, and then projected back to the original space;

in the local enhancement feedforward network, firstly, space recovery is carried out on an input sequence to enable the input sequence to become the dimension of an image; then the dimension is lifted through depth separable convolution and GELU activation functions; then, in a high-dimensional space, local features are captured by adopting depth separable convolution and GELU activation functions, so that interaction is generated among blocks, and the network expression capacity is enhanced; the convolution and gel activation functions and batch normalization are then again separable by inversion depth to recover to the original dimensions and flattened into one sequence, followed by a second addition with the input and layer normalized residual structure as output.

5. The method for splitting a crack image based on a double encoder in a complex environment according to claim 4, wherein each feature fusion module is formed as follows:

first, an intermediate feature Cv of a dual encoder input is received _i And T _i ，Cv _i Is an intermediate feature of the ith layer of the branch of the convolutional neural network coder, T _i Is an intermediate feature of the ith layer of the transducer encoder branch, and uses a convolution of 1 x 1 size to adjust Cv _i Is a channel dimension of (2); then, the channel attention is adopted to adjust the response of each channel of the two characteristics, the weight of the favorable channel is improved, the influence of the redundant channel is reduced, and the obtained productAnd->Then (I)>And->Matrix multiplication is carried out, and the correlation of the matrix multiplication and the matrix multiplication is enhanced, so that correlation enhancement characteristics are obtained; furthermore, the->And->A cross domain fusion module is also input to fuse information among different domains, and two different features are respectively subjected to linear transformation to obtain two groups of inquiry, key and value; calculating multi-head attention of each type of inquiry and keys and values of the other type, connecting two results in the channel direction, reducing the dimension by convolution with the size of 1 multiplied by 1, extracting effective information, and obtaining cross domain fusion characteristics; connecting said relevance enhancing feature, cross domain fusion feature,/->And->Fourth, the dimension is reduced by inverting the depth separable convolution, batch normalization and GELU activation functions, the depth separable convolution, batch normalization and GELU activation functions and the 1×1 convolution, batch normalization and GELU activation functions are fully fused to obtain multiple types of features, effective information is extracted, and finally the fused features F of the layer are output _i 。

6. The method for split image segmentation based on double encoders in a complex environment according to claim 1, wherein the decoder is composed of three decoder layers in series and a segmentation head, the decoder layer of each layer is composed of a sub-pixel convolution, an inverted depth separable convolution and a transform block along the channel direction, and the corresponding feature fusion module is input between the sub-pixel convolution and the inverted depth separable convolution of the decoder layer, and the segmentation head is composed of the sub-pixel convolution and 1 x 1 convolution.

7. The method for splitting a crack image based on a dual encoder in a complex environment as claimed in claim 1, wherein the loss function of the crack image splitting method in the model training stage adopts the sum of binary cross entropy BCE loss and Dice loss, namely BCE plus Dice:

wherein N represents the total number of pixels in the image, t _i Then it is the true class of pixel, p _i Then it is the network predicted pixel class and epsilon is the smoothing factor.

8. A crack image segmentation system based on double encoders in a complex environment is characterized by comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions capable of performing the dual encoder based fracture image segmentation method in a complex environment as set forth in any of claims 1-7.

9. A non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the dual encoder-based fracture image segmentation method in a complex environment as set forth in any one of claims 1-7.