CN117333750A

CN117333750A - Spatial registration and local global multi-scale multi-modal medical image fusion method

Info

Publication number: CN117333750A
Application number: CN202311513745.3A
Authority: CN
Inventors: 王丽芳; 郭威; 王晋光; 靳凯欣; 韩强
Original assignee: North University of China
Current assignee: North University of China
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-01-02

Abstract

The invention belongs to the field of medical image fusion, and particularly relates to a spatial registration and local global multi-scale multi-mode medical image fusion method. The method mainly utilizes a spatial registration network to register the images, eliminates the influence of image distribution difference on registration results, then utilizes a multiscale subtraction CNN branch and an MPViT branch to extract local and global multiscale characteristics of the images, and utilizes the extracted local and global multiscale characteristics to perform self-adaptive image fusion. The result shows that the spatial registration network has good effect in the registration process, and effectively eliminates the influence of the image distribution difference on the registration result; the local multi-scale features provided by the multi-scale subtraction CNN well retain the brightness and detail information of the image, the global multi-scale features provided by the MPViT well retain the edge information of the image, and the combination of the two well retain the structure of the fused image and the texture detail is rich.

Description

Spatial registration and local global multi-scale multi-modal medical image fusion method

Technical Field

The invention belongs to the field of medical image fusion, and particularly relates to a spatial registration and local global multi-scale multi-mode medical image fusion method.

Background

With the development of medical imaging technology, medical imaging plays an increasingly important role in clinical diagnosis and treatment. Different types of medical images, such as CT (computed tomography), MRI (magnetic resonance imaging) and PET (positron emission tomography), etc., each have their unique advantages and disadvantages. Because the information provided by a single mode is limited, the correct diagnosis of the illness state by a doctor is not facilitated, the doctor often needs to check the image information of different modes, and the workload is large and the efficiency is low. The multi-mode medical image fusion is to fuse information of medical images of different modes into one image, so that the limitation of single-mode information characterization is made up, and the diagnosis efficiency and accuracy of doctors are improved.

The existing multi-mode image fusion method is carried out on the basis of registered images. At present, a traditional registration method and a registration method based on deep learning exist in multi-mode image registration. In the traditional registration method, the continuous iterative optimization is needed, so that the registration time is long, the calculated amount is large, the method is not suitable for large-scale registration tasks, the quality of the extracted key points is sensitive, and the robustness of the algorithm is poor. In the current mainstream registration method based on deep learning, the influence of image distribution differences among different modes on registration results is not considered. Because of the distribution difference of different modes, the registration accuracy is reduced due to the distribution difference of different modes, and the registered image and the target image are difficult to be completely aligned in space.

The existing fusion method has two main types, namely a traditional fusion method and a fusion method based on deep learning. The traditional fusion method mainly focuses on the design of a fusion strategy, and mainly comprises a fusion method based on multi-scale transformation, a fusion method based on sparse representation, a fusion method based on subspace, a fusion method based on saliency and a fusion method based on mixing. The method is mainly used for manually designing fusion algorithms and fusion strategies, and in complex and diverse fusion tasks, different algorithms and fusion strategies cannot be designed for each task, so that the fusion efficiency is low. In order to solve the problems of the traditional fusion method, an image fusion method based on deep learning appears, and the method mainly performs feature extraction, feature fusion and image reconstruction on the image through a network model. For example, denseNetFuse proposes a dense connection network structure, where the coding part connects dense blocks in a residual dense connection manner to perform feature extraction, but this structure can only extract features on a single scale. Although the CNN fusion method improves generalization ability by learning local features, overcomes the drawbacks of the conventional method, due to its limited receptive field, CNN may lose some global context information useful for fusing images. SwinFuse adopts a pure transducer structure to extract the characteristics of the image, but the loss of the texture details of the image is caused; the TransMEF proposes a multi-exposure image fusion framework in which CNNs are connected in parallel with the Transformer blocks, and the CNNs and transformers are used to extract local and global information of the image at the same time, but the structure is only for single-scale feature extraction.

In summary, the method aims at the problems that the existing image registration method is low in efficiency, the influence of image distribution difference on registration results is ignored in the registration process, and the global context feature characterization is insufficient and the single-scale feature extraction capability is limited in the image fusion process. The invention provides a space registration and local global multi-scale multi-mode medical image fusion method, which mainly comprises the following steps: 1) And training the registration network through a space evaluator to enable the registration network to eliminate the influence of the image distribution difference on the registration result. 2) The extraction of the local global multi-scale features of the image is realized by designing an encoder network. 3) And combining the image registration task with the image fusion task to improve the image fusion efficiency.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a spatial registration and local global multi-scale multi-mode medical image fusion method.

In order to achieve the above purpose, the present invention is realized by the following technical scheme:

the invention provides a spatial registration and local global multi-scale multi-mode medical image fusion method. The method adopts a spatial registration network and a fusion network, and specifically comprises the following steps:

Step 1: in a spatial registration network, splicing an unregistered floating image M and a target image T according to a channel, and inputting the spliced floating image M and the target image T into a trained deformation field network R to obtain a predicted deformation field phi T2M;

step 2: registering the floating image M according to the predicted deformation field phi T2M to obtain a registered imageAnd input into a converged network;

step 3: in the fusion network, the registered images are recordedThe local global multi-scale feature extraction is carried out on the target image T and the target image T through the same encoder in the fusion network respectively;

step 4: and 3, respectively aggregating the features extracted from the two images, and then fusing according to a self-adaptive fusion strategy to obtain fusion features, and performing image reconstruction by a decoder in a fusion network according to the fusion features to obtain a fusion image.

The decoder of the present invention employs a structure comprising three identically structured convolutional layers and one convolutional layer having a convolutional kernel size of 1x 1. Of these three identically structured convolutional layers, each comprises two 3x3 sized convolutional blocks and two ReLU activation functions. Under the condition of not changing the space size of the feature map, the features transmitted by the self-adaptive fusion layer are used as the input of a decoder to gradually restore and generate a fused image.

Further, training of the spatial registration network in step 1 is:

firstly, a target image T and a floating image M are spliced and input into a deformation field network R according to a channel, feature extraction is carried out, and a deformation field phi T2M based on the target image T and a deformation field phi M2T based on the floating image M are obtained, wherein the formula is as follows:

u _T ＝R _θ (T，M) (1)

u _M ＝R _θ (M，T) (2)

∮T2M＝Id+u _T (3)

∮M2T＝Id+u _M (4)

wherein u is _T And u is equal to _M Respectively representing the learned deformation fields based on the target image T and the floating image M; id represents an identical deformation field; θ represents deformation field network related parameters;representing by minimizing the image similarity loss, < +.>Representing a minimized deformation field regularization constraint to optimize the registration network;

then, using a space transformation network to respectively perform displacement transformation on each pixel point of the target image T and the floating image M according to the deformation fields phi M2T and phi T2M to obtain registered imagesAnd-> Finally, the registered image +.>And floating image M, post-registration image +.>And the target image T are respectively transmitted to a trained space estimator to obtain a space error e ₁ And e ₂ The method comprises the steps of carrying out a first treatment on the surface of the And minimizes the spatial error e by equation (5) ₁ And e ₂ While optimizing parameters of the spatial registration network according to equation (6) minimizing spatial regularization constraints of the deformation field.

Further, the structure of the space evaluator in step 2: a 7 x 7 initial convolution block fills the surroundings of the input image, then two 3 x 3 convolutions perform shallow feature extraction, 9 residual blocks are sequentially stacked to perform deeper feature extraction, so that more spatial information is reserved, and finally, a convolution layer with a kernel size of 7 x 7 converts the feature vector into a scalar through two 3 x 3 upsampling and a Tanh activation function to represent the spatial difference of the two images.

Further, the structure of the converged network in the step 3 is as follows: adopting an encoder-decoder structure; the encoder consists of two branches of multiscale subtraction CNN and MPViT, and is used for extracting local and global multiscale characteristics of the image respectively;

the multi-scale subtraction CNN branch is used for calculating the difference characteristics between adjacent scales by utilizing a multi-scale subtraction unit, and the difference characteristics are fused to obtain complementary characteristics containing difference information between different scales;

MPViT branches adopt a multi-scale patch and multi-path structure, and total 4 stages, wherein each stage consists of a multi-scale patch module and a multi-path transducer module.

Further, the structure of the multi-scale subtraction CNN branch is that Res2Net-50 is adopted as a backbone network, characteristics of different scales are extracted, and each different scale is adopted The feature of the degree is subjected to 3X 3 convolution to reduce the number of channels to 64 so as to reduce the number of parameters of subsequent operation, a plurality of multi-scale subtracting units are added in the horizontal and vertical directions and are connected, and a series of complementary features with different orders and cross scales are calculatedFeatures of corresponding dimensions->And trans-scale supplemental features->And performing element-by-element addition to obtain complementary enhancement features CEi, wherein the formula is as follows:

where i represents the corresponding scale and n represents the different orders.

Further, the multi-scale subtraction unit is configured to apply a multi-scale convolution filter of a fixed size to the two feature maps F _A And F _B Extracting features, and subtracting two features extracted by the same scale convolution filter by adopting element-by-element subtraction to obtain detail and structure difference values between adjacent scale featuresi represents convolution kernels with different sizes, and finally, the obtained scale detail and structure difference value are polymerized to obtain complementary features of adjacent scales>

Further, the MPViT branch has 4 stages, each stage firstly embeds the same feature into the patch with different scales at the same time, then independently inputs the patch sequences with different scales into a transducer encoder through a plurality of paths, carries out global feature extraction, and aggregates to obtain global features under the patch sequences with different scales, so that feature representation with different scales is realized on the same feature level;

The structure of the multi-scale patch module is as follows: in the first stage, output features from Conv-step are givenUsing convolution F with kernel size k _k×k (. Cndot.) mapping of input Xi to new token feature F _k×k (X _i ) The formula is as follows:

wherein k is the size of a convolution kernel, s is the step length, and p is the filling;

output characteristics at given previous stageWhen the method is used, the features with different patch scales and the same feature graph size are obtained through the multi-scale patch module, and then mapped into the features with the length of Hi multiplied by W through the reshape operation _i The number of channels is C _i The feature sequences are respectively transmitted to a multipath transducer module for global feature extraction and aggregation;

and the multipath transducer module is used for adding a corresponding transducer branch to each patch in each scale, and then adding a local convolution branch in the multipath transducer module for better capturing the structural information and the local relation of each patch. After feature extraction, the global features of the transform branches are changed into feature patterns through two-dimensional remolding and connected with local features of the convolution branches in series to obtain an aggregate feature A _i And outputs the final feature X by a convolution of 1×1 _i +1 as input to the next stage;

Wherein the transform branch consists of two normalization layers, a multi-headed self-attention layer, a feed-forward network layer, and an average pooling layer, and the local convolution branch consists of two 1×1 convolutions and a 3×3 depth convolution and residual connection.

Further, training the fusion network by adopting a loss function, wherein the loss function is the sum of a mean square error loss, a structural similarity loss and a total variation loss;

the mean square error loss is calculated by calculating the square of the difference of corresponding pixel values between the reconstructed image and the source image, and the difference between the generated image and the target image is measured by averaging, so that the pixel level reconstruction is ensured, and the formula is as follows:

wherein I is _out For the output reconstructed image, I _in I and j are respectively the abscissa and the ordinate of the pixel point of the input source image;

structural similarity loss trains the network by comparing structural similarity between the reconstructed image and the source image, as follows:

L _ssim ＝1-SSIM(I _out ，I _in ) (11)

wherein the SSIM function represents a structural similarity between the reconstructed image and the source image;

the total variation loss is used to penalize the spatial transform in the image, as follows:

R(i，j)＝I _out (i，j)-I _in (i，j) (12)

L _TV ＝∑ _i，j (||R(i，j+1)-R(i，j)|| ₂ +||R(i+1，j)-R(i，j)|| ₂ ) (13)

wherein R (i, j) is the difference between the original image and the reconstructed image, I.I. | ₂ Is the L2 norm;

The loss function formula is as follows:

L＝L _mse +λ ₁ L _ssim +λ ₂ L _TV (14)

wherein lambda is ₁ And lambda is ₂ The weight coefficients of the structural similarity loss and the total variation loss in the total loss are respectively.

Further, the adaptive fusion strategy is:

first, GLKMM (i, j) (K e a, B) is used to represent the energy value of the feature map at (i, j), K represents the number of the feature map of a specific image, and M is the number of channels of the feature map. The policy applied to the corresponding location is determined by setting a threshold T. The specific rules are as follows:

1) If it isWhen the energy of the two images at the corresponding positions is smaller than a threshold value T, a maximum value strategy is adopted for fusion, and the fusion characteristics F (i, j) are fused according to the following formula:

2) If it isWhen the energy of the two images at the corresponding positions is larger than a threshold value T, the fusion is carried out by adopting an L1 range formula, and the fusion characteristics F (i, j) are as follows:

wherein C is _K (i, j) represents the initial activity level, I.I ₁ Represents a first norm, W _A And W is equal to _B Respectively representing the weights of the corresponding positions of the two feature maps.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a spatial registration and local global multi-scale multi-mode medical image fusion method, which utilizes a spatial registration network to register images, eliminates the influence of image distribution difference on registration results, then uses a multi-scale subtraction CNN branch and an MPViT branch to extract local and global multi-scale characteristics of the images, and utilizes the extracted local and global multi-scale characteristics to carry out self-adaptive image fusion. Experimental results show that the spatial registration network achieves good effect in the registration process, and the influence of image distribution differences on the registration result is effectively eliminated; the local multi-scale features provided by the multi-scale subtraction CNN well retain the brightness and detail information of the image, the global multi-scale features provided by the MPViT well retain the edge information of the image, and the combination of the two well retain the structure of the fused image and the texture detail is rich.

Drawings

FIG. 1 is a full frame diagram of the present invention;

FIG. 2 is a spatial registration network training diagram;

FIG. 3 is a block diagram of a spatial evaluator;

FIG. 4 is a spatial evaluator training flow chart;

FIG. 5 is a diagram of a multi-scale subtractive CNN branch structure;

FIG. 6 is a multi-scale subtraction of MPViT structure;

FIG. 7 is a diagram of a multi-scale subtraction unit;

FIG. 8 is a block diagram of a multi-scale patch module;

FIG. 9 is a block diagram of a multipath transducer module;

fig. 10 (a) is a transducer block structure diagram, and fig. 10 (b) is a partial convolution block structure diagram;

FIG. 11 is a registration fusion process of the present invention;

FIG. 12 is a comparison of registration results;

FIG. 13 is a graph showing comparison of fusion results.

Detailed Description

The present invention will be described more fully hereinafter in order to facilitate an understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Example 1

As shown in fig. 1, the embodiment provides a spatial registration and local global multi-scale multi-mode medical image fusion method, which adopts a spatial registration network and a fusion network, and a complete framework is shown in fig. 1, and specifically includes the following steps:

the training process of the spatial registration network is as follows:

in the spatial registration network, a deformation field network R of a similar U-NET structure with residual connection is used to generate the deformation field between the two images. The training process is shown in fig. 2. Firstly, a target image T and a floating image M are spliced and input into a deformation field network R according to a channel, feature extraction is carried out, and a deformation field phi T2M based on the target image T and a deformation field phi M2T based on the floating image M are obtained, wherein the formula is as follows:

u _T ＝R _θ (T，M) (1)

u _M ＝R _θ (M，T) (2)

∮T2M＝Id+u _T (3)

∮M2T＝Id+u _M (4)

then, using a space transformation network to respectively perform displacement transformation on each pixel point of the target image T and the floating image M according to the deformation fields phi M2T and phi T2M to obtain registered images And-> Finally, the registered image +.>And floating image M, post-registration image +.>And the target image T are respectively transmitted to a trained space estimator to obtain a space error e ₁ And e ₂ The method comprises the steps of carrying out a first treatment on the surface of the And minimizes the spatial error e by equation (5) ₁ And e ₂ While optimizing parameters of the spatial registration network according to equation (6) minimizing spatial regularization constraints of the deformation field. Registration task with registration network in T-direction M and MGood performance is achieved in the registration task to T.

The registered images eliminate the space errors among the original different modes, thereby eliminating the conditions of various distortion, image blurring and the like which possibly occur in the fusion stage of the images.

The spatial estimator is structured as shown in fig. 3, where a 7×7 initial convolution block fills the surrounding of the input image, then two 3×3 convolutions perform shallow feature extraction, and then 9 residual blocks are sequentially stacked to perform deeper feature extraction, so as to preserve more spatial information, and finally, by two 3×3 upsampling and a Tanh activation function, a convolution layer with a kernel size of 7×7 converts the feature vector into a scalar, representing the spatial difference of the two images.

The space estimator is mainly used for calculating the space difference between the registered image and the source image, and the trained space estimator is utilized to conduct back propagation in the training process of the registration network, so that the influence of the image distribution difference on the registration result in the registration process of the network model is eliminated, and the registration accuracy is improved.

The training flow of the spatial evaluator is shown in fig. 4. Transform T two random spaces ₁ And T ₂ Applied to the original input image I to generate converted images I respectively ₁ And I ₂ . Due to I ₁ And I ₂ From the same original image, so they differ only in spatial position. I ₁ And I ₂ Spatial error between through I ₁ And I ₂ The spatial error is used as a true label for training the spatial evaluator. Adding random noise ε to I ₁ Or I ₂ To create a distribution difference between them. By combining I ₁ +ε and I ₂ A spatial estimator is input to obtain a prediction error. Finally, the spatial estimator is focused more on the difference of the images in space by minimizing the mean absolute value error (MAE) between the prediction error and the actual spatial error, thereby further improving the prediction accuracy of the spatial estimator. The loss function is as follows:

the trained spatial estimator is used for calculating the spatial difference between the registered image and the target image, so that the registration network is optimized, and the registration effect is improved.

step 3: in the fusion network, the registered images are recorded The local global multi-scale feature extraction is carried out on the target image T and the target image T through the same encoder in the fusion network respectively; the structure of the fusion network adopts an encoder-decoder structure; the encoder consists of two branches of multi-scale subtraction CNN and MPViT, and the structure diagrams are respectively shown in fig. 5 and 6 and are respectively used for extracting local and global multi-scale characteristics of the image;

The multi-scale subtraction CNN calculates the difference characteristics between adjacent scales by using a multi-scale subtraction unit, and fuses the difference characteristics to obtain complementary characteristics containing difference information between different scales. The multi-scale subtraction CNN branch is structured by adopting Res2Net-50 as backbone network, extracting features of different scales, convoluting each feature of different scales by 3×3, reducing the number of channels to 64 to reduce the number of parameters of subsequent operation, adding multiple multi-scale subtraction units in horizontal and vertical directions, connecting them, and calculating a series of complements of different orders and cross scales Filling featureFeatures of corresponding dimensions->And trans-scale supplemental features->Performing element-by-element addition to obtain complementary enhancement feature CE ⁱ The formula is as follows:

CE is to be CE ⁵ Upsampling to and CE ⁴ Equal size and by means of element-by-element addition to CEs ⁵ And CE (CE) ⁴ Fusing to obtain a fused feature CF containing two scale information of Stage4 and Stage5 ^5-4 . CF is then subjected to ^5-4 Upsampling to and CE ³ The same size and in CE ³ Fusion is carried out to obtain fusion characteristics CF containing three scale information of Stage5 to Stage3 ^5-3 . According to the fusion flow, local features CF containing rich texture information are finally obtained through up-sampling and fusion for multiple times and are used for fusion with global features. In the process of multi-scale feature fusion, difference information among different scale features is calculated through a subtraction unit, complementary enhancement features of different scales are obtained, and the richness of local features is improved.

Further, the multi-scale subtracting unit structure is as shown in FIG. 7, and the two feature maps FA and F are respectively obtained by three multi-scale convolution filters (1×1,3×3,5×5) with fixed sizes _B Extracting features, and subtracting two features extracted by the same scale convolution filter by adopting element-by-element subtraction to obtain detail and structure difference values between adjacent scale features i represents convolution kernels with different sizes, and finally, the obtained scale detail and structure difference value are polymerized to obtain complementary features of adjacent scales>The multi-scale subtraction unit formula is:

where θ is a per-element subtraction and Filter is a convolution Filter.

The transducer has a self-attention mechanism so that the transducer can model the dependency relationship between any two positions, and therefore has strong global perceptibility. To enable the encoder to obtain detailed multi-scale global context information, global multi-scale feature extraction is performed using MPViT as one branch of the encoder. The MPViT branch comprises 4 stages, wherein the same feature is firstly embedded into the patch with different scales at the same time in each stage, then patch sequences with different scales are independently input into a transducer encoder through a plurality of paths, global feature extraction is carried out, and global features under the patch sequences with different scales are obtained through aggregation, so that feature representation with different scales is realized on the same feature level; the multi-scale patch and multi-path transducer structure achieves a more diverse representation of multi-scale features than traditional Visio Transformer, thereby allowing the fused image to retain more global context information.

The architecture of the multi-scale patch module is shown in FIG. 8, which implements different scale patch representations of fine and coarse granularity on the same feature level by using different convolution kernel sizes. In the first stage, output features from Conv-step are givenUsing convolution F with kernel size k _k×k (. Cndot.) mapping of input Xi to new token feature F _k×k (X _i ) The formula is as follows:

wherein k is the size of a convolution kernel, s is the step length, and p is the filling; the length of the token sequence at different scales of patch is adjusted by adjusting the fill and step sizes of the convolutions, so features of the same size but different patch scales are constructed by constructing convolutions of different kernel sizes (3×3,5×5,7×7). Since the successive convolution operations with the same number of channels and convolution kernel size enlarge the receptive field, two successive 3 x 3 convolution layers are chosen instead of the 5 x 5 convolution and three successive 3 x 3 convolution layers are chosen instead of the 7 x 7 convolution in the actual process. Also, to reduce model parameters and computational overhead, the convolution uses a 3×3 depth separable convolution, including a 3×3 depth convolution and a 1×1 point convolution. The channel size is C', padding is 1, the step is s, where s is 2 when reducing the spatial resolution, otherwise 1. Output characteristics at given previous stage When the method is used, the features F with different patch scales and the same feature map size are obtained through the multi-scale patch module _3×3 (X _i )，F _5×5 (X _i )，F _7×7 (X _i ) Then, mapping the feature sequences into feature sequences with the length of Hi multiplied by Wi and the channel number of Ci through reshape operation, and respectively conveying the feature sequences to a multipath transducer module for global feature extraction and aggregation;

the multipath transducer module adds a corresponding transducer branch to each scale patch, and then additionally adds a local convolution branch in the multipath transducer module for better capturing the structural information and the local relation of each patch. The global features of the transform branches are changed into the form of feature graphs through two-dimensional remolding and the local features of the convolution branches are connected in series to obtain an aggregate feature Ai, and final features xi+1 are output through convolution of 1×1 and are used as input of the next stage; and adopts an efficient decomposition self-attention mechanism to lighten the calculation burden of the multipath structure. Meanwhile, in the feature mapping stage of the transducer, multiple CLS tokens in multi-token-based weak-supervision semantic segmentation are used to replace global average pooling to acquire more space structure information, and the transducer branch structure is shown in fig. 10 (a), in which the transducer block is a transducer branch. The transducer branch consists of two normalization layers, a multi-head self-attention layer, a feed-forward network layer and an average pooling layer.

While the self-intent mechanism in the transform may capture long-term dependencies (i.e., global context), structural information and local relationships in each patch may be ignored. In contrast, CNN can capture the structure information and local relation of each patch with better local connectivity in translational invariance, so in the multipath transform module, one convolution branch is added to represent local features, and the local convolution block structure is shown in fig. 10 (b), where the local convolution block is a local convolution branch, and the local convolution branch is formed by two 1×1 convolutions and one 3×3 depth convolutions and residual connection. Wherein the depth convolution is to reduce the computational effort.

Training the fusion network by adopting a loss function, wherein the loss function is the sum of mean square error loss, structural similarity loss and total variation loss;

L _ssim ＝1-SSIM(I _out ，I _in ) (13)

R(i，j)＝I _out (i，j)-I _in (i，j) (14)

L _TV ＝∑ _i，j (||R(i，j+1)-R(i，j)|| ₂ +||R(i+1，j)-R(i，j)|| ₂ ) (15)

the loss function formula is as follows:

L＝L _mse +λ ₁ L _ssim +λ ₂ L _TV (16)

The invention adopts the self-adaptive fusion strategy, can adaptively select proper fusion rules according to the difference of energy contained in each position of the feature map, and better saves important feature information.

The energy value of the feature map at (i, j) is first expressed by GLKM (i, j) (K epsilon A, B), K represents the number of the feature map of the specific image, and M is the channel number of the feature map. The policy applied to the corresponding location is determined by setting a threshold T. The specific rules are as follows:

2) If it isWhen the energy of the two images at the corresponding positions is larger than a threshold value T, fusion is carried out by adopting an L1 range, and the characteristics F (i, j) are fused:

After the characteristics of the two images are obtained through the encoder, energy values at each position on the corresponding channels are calculated respectively according to the fusion rules, the characteristics are fused according to the energy values, a maximum value strategy or an L1 norm is selected to obtain final fusion characteristics, and the channel numbers of the fusion characteristics are adjusted step by step through the decoder to obtain the final fusion image.

And carrying out spatial alignment on images of different modes through a spatial registration network, carrying out feature extraction through an encoder with local-global multi-scale feature extraction capability, carrying out self-adaptive fusion on image features extracted based on the encoder, and finally reconstructing through a decoder to obtain a fused image. The registration fusion flow of the present invention is shown in fig. 11.

The fusion method of the invention is divided into a training stage and a testing stage, and the specific process is as follows:

(1) Training phase

The fusion network training of the present invention, i.e., encoder-decoder training, is shown in algorithm 1. Each iteration inputs a batch size medical image, the image is subjected to local and global multiscale feature extraction through multiscale subtraction CNN and MPViT branches of an encoder to obtain a feature flg, the extracted feature flg is input into a decoder to perform image reconstruction to obtain a reconstructed image img_d, and the reconstructed image is subjected to multiple rounds of training according to the mean square error loss L of the reconstructed image and an original input image _MSE Similarity loss L _ssim And total variation loss L _TV The encoder and decoder parameters are optimized.

(2) Test phase

In order to verify the reliability and stability of the model in practical application, a verification test is performed on the network model, and the test process is as algorithm 2. The method comprises the steps of inputting an unregistered pair of medical images T and M, obtaining registered images MO phi T2M and T (or M and TO phi T2M) through a trained registration network of the images T and M, inputting the registered images MO phi T2M and T into a trained encoder TO perform feature extraction, obtaining features GLMO phi T2M and features GLT, fusing the extracted features through a self-adaptive fusion module TO obtain fusion features, and reconstructing the fused features through a trained decoder TO obtain a fused image.

Example 2

In order to verify the effect of the invention on registration fusion, image registration based on the feature descriptors of local neighborhood is selected in the registration stage, and the medical image registration based on Voxelmorph and the multi-mode image registration based on RegGAN are used as registration comparison methods. In the Fusion stage, an image Fusion method U2Fusion based on a convolutional neural network, an image Fusion method SeAFusion based on a convolutional self-encoder, an image Fusion method UMF-CMGR based on a residual dense network, an image Fusion method DPE-MEF based on an UNET architecture and an image Fusion method TransMEF based on CNN and a trans-former parallel are selected as Fusion comparison methods.

(1) Registration contrast

The CT-to-MR registration and the MR-to-CT registration are respectively tested by adopting the registration method of the invention and the three registration methods, and the registration results are shown in figure 12. And the registration accuracy of the images is judged by overlapping the registered images with the target image, adjusting the transparency and observing the overlapping degree of the two images. In the registration test of CT to MR, the image after registration of MIND method has a large degree of distortion, the overlapping comparison of the registered image and the target image can see that the image contours can not be overlapped well, and compared with the registration error of VoxelMorph and RegGAN, the MIND method has a small registration error, and only has a partial misalignment phenomenon at the edge of the image. In contrast, the method of the invention has good performance in image distortion processing and edge alignment accuracy. In the registration test of MR to CT, the three methods MIND, voxelMorph and RegGan have the misalignment phenomenon at the image edge, wherein the misalignment phenomenon of VoxelMorph at the image edge is most obvious.

For objectively evaluating the advantages and disadvantages of the registration method and the three registration methods, 5 objective evaluation indexes are selected in the experiment: the registration results are objectively evaluated by the Dice coefficient, the HD95, the Root Mean Square Error (RMSE), the Mean Absolute Error (MAE) and the Mutual Information (MI). Wherein the Dice coefficient is a statistical index for evaluating the similarity or the overlapping degree of the two sets, and the larger the value of the Dice coefficient is, the higher the overlapping degree of the two sets is; HD95 is 95 percent of the Hausdorff distance, where Hausdorff distance is a distance measure that measures similarity between two sets of points, the smaller the value of HD95, the more similar the two images; root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are a statistical indicator for measuring the difference between observed and actual values, with smaller values of RMSE and MAE representing smaller differences between observed and actual values; mutual Information (MI) measures the degree of information overlap of two images, and the higher the mutual information value, the better the registration quality.

The average values of the visual evaluation indexes of the registration method and the three registration methods are shown in table 1:

TABLE 1 registration objective evaluation index mean

/>

It can be seen from table 1 that the registration method of the present invention is higher in all five objective evaluation indexes than the other three registration methods.

(2) Fusion comparison

The method according to the present invention and the above five comparison methods are compared with each other to obtain a fusion result, as shown in fig. 13. From the figure, differences in results can be observed for the different fusion methods. The Fusion result of U2Fusion only adopts a characteristic extraction mode of multi-scale convolution, so that a certain degree of blurring and information loss are displayed at the image edge. SeAFusion only uses residual connection to strengthen the capability of the network to feature extraction, ignores feature information of different scales, and results in poor contrast of fusion results and unclear detail information. UMF-CMGR uses a feature extraction approach similar to SeAFusion, resulting in some artifacts and noise in the fusion result. The fusion result of DPE-MEF retains the detailed information of the image, but the brightness of the image is not high, and focus information cannot be highlighted well. Because the TransMEF only uses convolution and Transfromer with a single scale to extract the features, the local information and the global information in the extracted features are insufficient, and the edge smoothness and the brightness of the fusion result are kept insufficient. In contrast, the fusion method provided by the invention well reserves local multiscale information and global multiscale information, so that not only is detail information of different modes well reserved in the fusion result, but also excellent performance is achieved in the aspect of brightness maintenance, the visual effect of the whole image is optimal, and doctors can be better helped to accurately judge the illness state.

In order to objectively evaluate the merits of the fusion effect of the invention and the fusion effect of the five comparison methods, 7 objective evaluation indexes are selected in the experiment: standard Deviation (SD), spatial Frequency (SF), mutual Information (MI), differential correlation Sum (SCD), visual fidelity (VIFF), gradient fusion performance (QAB/F), structural Similarity (SSIM) compares fusion effects. Wherein SD measures the degree of change of pixel values in the image, a higher SD value means that the image has a larger gray scale change, indicating that the image has richer details and contrast; SF indicates the rate of change in the image, and higher SF values indicate that the image contains more edge detail information; the MI measures the information amount sharing degree between the fusion image and the source image, and a higher MI value indicates that the fusion image is more relevant to the source image, so that more common information is reserved; the SCD uses the difference value between one source image and the fusion image to describe the information of the other source image in the fusion image, wherein the larger the CD value is positive, the higher the correlation between the fusion image and the source image is; the VIFF measures the information consistency between the fusion image and the source image, and a higher VIFF value indicates that the fusion image can better keep the visual information of the original image; QAB/F represents the quality of visual information obtained from the input image, higher QAB/F values representing better quality of the fused image; the SSIM measures the structural and texture similarity between the fused image and the source image, and a higher SSIM value indicates that the fused image retains more structural information of the source image.

The mean values of objective evaluation indexes of the fusion effect of the invention and the fusion effect of the five comparison methods are shown in table 2:

table 2 fused objective evaluation index means

It can be seen from table 2 that the performance of the fusion method of the present invention is all higher than five comparative methods in six indices, and is also in the first five other methods in SF indices.

Claims

1. The spatial registration and local global multi-scale multi-mode medical image fusion method is characterized by adopting a spatial registration network and a fusion network, and comprises the following steps of:

step 3: in the fusion network, the registered images are recordedRespectively fused with the target image TCarrying out local global multi-scale feature extraction by combining the same encoders in the network;

2. The method for spatial registration and local global multi-scale multi-modal medical image fusion according to claim 1, wherein the training of the spatial registration network in step 1 is:

u _T ＝R _θ (T,M) (1)

u _M ＝R _θ (M,T) (2)

∮T2M＝Id+u _T (3)

∮M2T＝Id+u _M (4)

wherein u is _T And u is equal to _M Respectively representing the learned deformation fields based on the target image T and the floating image M; id represents an identical deformation field; θ represents deformation field network related parameters;representing that by minimizing the image similarity loss,representing a minimized deformation field regularization constraint to optimize the registration network;

then, using a space transformation network to respectively perform displacement transformation on each pixel point of the target image T and the floating image M according to the deformation fields phi M2T and phi T2M to obtain registered imagesAnd-> Finally, the registered images are processedAnd floating image M, post-registration image +.>And the target image T are respectively transmitted to a trained space estimator to obtain a space error e ₁ And e ₂ The method comprises the steps of carrying out a first treatment on the surface of the And minimizes the spatial error e by equation (5) ₁ And e ₂ While optimizing parameters of the spatial registration network according to equation (6) minimizing spatial regularization constraints of the deformation field.

3. The method of spatial registration and local global multiscale multi-modal medical image fusion according to claim 2, wherein the spatial evaluator is structured to: filling the periphery of the input image for a 7×7 initial convolution block, performing shallow feature extraction on two 3×3 convolutions, sequentially stacking 9 residual blocks to perform deeper feature extraction, thereby retaining more spatial information, and finally converting the feature vector into a scalar by using a convolution layer with a kernel size of 7×7 with two 3×3 upsampling and a Tanh activation function to represent the spatial difference of the two images.

4. The method for spatial registration and local global multi-scale multi-modal medical image fusion according to claim 1, wherein the structure of the fusion network in step 3 is: adopting an encoder-decoder structure; the encoder consists of two branches of multiscale subtraction CNN and MPViT, and is used for extracting local and global multiscale characteristics of the image respectively;

the multi-scale subtraction CNN branch is used for calculating the difference characteristics between adjacent scales by using a multi-scale subtraction unit, and fusing the difference characteristics to obtain complementary characteristics containing difference information between different scales;

The MPViT branch adopts a multi-scale patch and multi-path structure, and has 4 stages, wherein each stage consists of a multi-scale patch module and a multi-path transducer module.

5. The method for spatial registration and local global multi-scale multi-modal medical image fusion according to claim 4, wherein the multi-scale subtraction CNN branch is structured by adopting Res2Net-50 as a backbone network, extracting features of different scales, convolving each feature of different scales by 3 x 3, reducing the number of channels to 64 to reduce the number of parameters of subsequent operations, adding a plurality of multi-scale subtraction units in the horizontal and vertical directions, connecting the multi-scale subtraction units, and calculating a series of complementary features of different orders and trans-scalesFeatures of corresponding dimensions->And trans-scale supplemental features->Performing element-by-element addition to obtain complementary enhancement feature CE ⁱ The formula is as follows:

6. The method for spatial registration and local global multi-scale multi-modal medical image fusion according to claim 5, wherein the multi-scale subtraction unit is configured to separate two feature maps F by a multi-scale convolution filter of a fixed size _A And F _B Extracting features, and subtracting two features extracted by the same scale convolution filter by adopting element-by-element subtraction to obtain detail and structure difference values between adjacent scale featuresi represents convolution kernels with different sizes, and finally, the obtained scale detail and structure difference value are polymerized to obtain complementary features of adjacent scales>

7. The method for spatial registration and local global multi-scale multi-mode medical image fusion according to claim 4, wherein the MPViT branch has 4 stages, each stage firstly embeds the same feature into patches of different scales simultaneously, then independently inputs patch sequences of different scales into a transducer encoder through a plurality of paths, carries out global feature extraction, and aggregates global features under the patch sequences of different scales, so that feature representation of different scales is realized on the same feature level;

the structure of the multi-scale patch module is as follows: in the first stage, output features from Conv-step are givenUsing convolution F with kernel size k _k×k (. Cndot.) input Xi maps to a new token feature F _k×k (X _i ) The formula is as follows:

Output characteristics at given previous stageWhen the method is used, the features with different patch scales and the same feature graph size are obtained through the multi-scale patch module, and then mapped into the length H through the reshape operation _i ×W _i The number of channels is C _i The feature sequences are respectively transmitted to a multipath transducer module for global feature extraction and aggregation;

the multipath transducer module adds a corresponding transducer branch to each scale patch, then additionally adds a local convolution branch to the multipath transducer module for better capturing the structural information and the local relation of each patch, and after extracting the characteristics, changes the global characteristics of the transducer branch into the characteristic diagram form through two-dimensional remodelling and connects the local characteristics of the convolution branch in series to obtain an aggregation characteristic A _i And outputs the final feature X by a convolution of 1×1 _i +1 as input to the next stage;

8. The method of spatial registration and local global multi-scale multi-modal medical image fusion according to claim 1, characterized in that the fusion network is trained with a loss function, which is the sum of mean square error loss, structural similarity loss and total variation loss;

the structural similarity loss trains the network by comparing the structural similarity between the reconstructed image and the source image, as follows:

L _ssim ＝1-SSIM(I _out ,I _in ) (11)

the total variation loss is used to penalize the spatial transformation in the image, and the formula is as follows:

R(i,j)＝I _out (i,j)-I _in (i,j) (12)

L _TV ＝Σ _i,j (||R(i,j+1)-R(i,j)|| ₂ +||R(i+1,j)-R(i,j)|| ₂ ) (13)

the loss function formula is as follows:

L＝L _mse +λ ₁ L _ssim +λ ₂ L _TV (14)

9. The method of spatial registration and local global multi-scale multi-modal medical image fusion according to claim 1, wherein the adaptive fusion strategy is:

firstly, GLKM (i, j) (K epsilon A, B) is used for representing the energy value of the feature map at (i, j), K represents the number of the feature map of a specific image, and M is the channel number of the feature map; the strategy applied by the corresponding position is determined by setting a threshold T, and the specific rules are as follows: