CN114372986A

CN114372986A - Attention-guided multi-modal feature fusion image semantic segmentation method and device

Info

Publication number: CN114372986A
Application number: CN202111658857.9A
Authority: CN
Inventors: 钦闯; 邹文斌; 田时舜; 李霞; 邹辉
Original assignee: Huishi Innovation Shenzhen Co ltd; Shenzhen University
Current assignee: Huishi Innovation Shenzhen Co ltd; Shenzhen University
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-19
Anticipated expiration: 2041-12-30
Also published as: CN114372986B

Abstract

According to the image semantic segmentation method and device for attention-directed multi-modal feature fusion disclosed by the embodiment of the invention, the extracted color image features and depth image features are mixed; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating the noise of a depth map, and adaptively aligning the two features; in order to further complementarily fuse the two part features, the complementary relation between the color image and the depth image is self-adaptively learned by acquiring the importance degrees of the corresponding positions of the two part features, so that the complementary fusion of the multi-mode features is realized; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, and more detail information is added, so that more information focuses on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.

Description

Attention-guided multi-modal feature fusion image semantic segmentation method and device

Technical Field

The invention relates to the technical field of image processing, in particular to an image semantic segmentation method and device for attention-guided multi-modal feature fusion.

Background

The semantic segmentation aims at accurately classifying each pixel point in an image, is a pixel-level classification method, and is widely applied to a plurality of fields of vision-based automatic driving, man-machine interaction, medical image segmentation, three-dimensional map reconstruction and the like. The scene information in the image is effectively acquired through accurate pixel classification, the specific position of each target in the image can be acquired through the segmentation result, the category and the state of each target can be further acquired, and the automatic understanding of the scene by a computer through the acquired image information is one of the most challenging tasks in computer vision. In recent years, depth cameras, such as Intel's Realsense camera, microsoft's Kinect camera, and depth camera, have been widely used to improve semantic segmentation performance. Compared with the color image, the depth information provides semantic information and also provides the size and geometric information of an object in an actual scene, and the semantic segmentation performance is further improved.

For RGB-D semantic segmentation, at present, many methods mainly generate features with more representation capability by fusing RGB image features and depth image features to improve the performance of RGB-D image semantic segmentation, and generally adopt a coding and decoding structure, wherein the structure can be divided into early stage fusion, middle stage fusion and later stage fusion according to the fusion stage. Most of the fusion modules adopted by the methods directly fuse the depth map features and the color map features, and the depth information is not fully utilized, so that the complementary fusion of the color map features and the depth map features is realized. Meanwhile, because the imaging of depth cameras such as Realsense is influenced by factors such as illumination, a sliding surface and hardware interference, the problems of fuzzy boundary, large cavity area and the like of a depth image are caused, the method for directly fusing the characteristics of the two modes cannot eliminate noise existing in depth information, interference characteristics are introduced into a network model, the segmentation precision is finally reduced, and the robustness is poor.

Disclosure of Invention

The embodiment of the invention mainly aims to provide an attention-guided multi-modal feature fusion image semantic segmentation method and device, which can at least solve the problems of poor robustness, low segmentation precision and the like of an RGB-D image semantic segmentation model provided in the related technology.

In order to achieve the above object, a first aspect of embodiments of the present invention provides an attention-guided multi-modal feature fusion image semantic segmentation method, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network, and a multi-layer feature fusion decoding network, and the method includes:

respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;

connecting the color image features and the depth image features along a channel dimension through the multi-mode feature alignment network, performing convolution operation to obtain mixed features, aligning the color image features and the depth image features on the channel dimension and the space dimension based on the mixed features, and obtaining the aligned color image features and the aligned depth image features;

respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image characteristic and a second weight matrix of the importance degree of each position point of the aligned depth image characteristic through the cross-modal characteristic fusion network, and then fusing the first weight matrix with the aligned color image characteristic and fusing the second weight matrix with the aligned depth image characteristic and then performing superposition processing to obtain a fusion characteristic;

and performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph.

In order to achieve the above object, a second aspect of the embodiments of the present invention provides an attention-guided multi-modal feature fusion image semantic segmentation apparatus, applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network, and a multi-layered feature fusion decoding network, the apparatus including:

the extraction module is used for respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;

the alignment module is used for performing convolution operation to obtain mixed features after the color image features and the depth image features are connected along channel dimensions through the multi-mode feature alignment network, and aligning the color image features and the depth image features on the channel dimensions and the space dimensions based on the mixed features to obtain the aligned color image features and the aligned depth image features;

the fusion module is used for respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image feature and a second weight matrix of the importance degree of each position point of the aligned depth image feature through the cross-modal feature fusion network, then fusing the first weight matrix and the aligned color image feature, and fusing the second weight matrix and the aligned depth image feature and then performing superposition processing to obtain a fusion feature;

and the decoding module is used for performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network and outputting a semantic segmentation graph.

According to the image semantic segmentation method and device for attention-guided multi-modal feature fusion provided by the embodiment of the invention, the extracted color image features and depth image features are mixed; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating noise existing in a depth map, and aligning the two parts of features in a self-adaptive manner; in order to further complementarily fuse the two part features, the complementary fusion of the multi-modal features is realized by acquiring the importance degrees of the corresponding positions of the two parts of features and learning the complementary relation between the color image and the depth image in a self-adaptive manner; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, more detail information is added, more information is focused on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.

Other features and corresponding effects of the present invention are set forth in the following portions of the specification, and it should be understood that at least some of the effects are apparent from the description of the present invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a basic flow of a semantic segmentation method for an image according to a first embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a multi-modal feature alignment network based on attention guidance according to a first embodiment of the present invention;

fig. 3 is a schematic structural diagram of a cross-modal feature fusion network according to a first embodiment of the present invention;

fig. 4 is a schematic structural diagram of a multi-layer feature fusion decoding network according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating program modules of an image semantic segmentation apparatus according to a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a third embodiment of the invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment:

in order to solve the technical problems of poor robustness, low segmentation accuracy and the like of an RGB-D image semantic segmentation model provided in the related art, the present embodiment provides an image semantic segmentation method, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and as shown in fig. 1, is a basic flow diagram of the image semantic segmentation method provided in the present embodiment, and the image semantic segmentation method provided in the present embodiment includes the following steps:

step 101, respectively performing feature extraction processing on the color image and the corresponding depth image through a feature extraction network to obtain color image features and depth image features.

Specifically, for example, a Kinect camera is configured with a color camera for capturing color images and an infrared camera for capturing depth images simultaneously, and the depth images can provide more geometric information and spatial information.

And 102, connecting the color image features and the depth image features along the channel dimension through a multi-mode feature alignment network, performing convolution operation to obtain mixed features, aligning the color image features and the depth image features on the channel dimension and the space dimension based on the mixed features, and obtaining the aligned color image features and the aligned depth image features.

Specifically, noise information inevitably exists in the depth map in the acquisition process, noise existing in the depth features cannot be considered in the conventional fusion method, and the multi-mode feature alignment network provided by the embodiment can effectively eliminate the noise features and realize alignment of the two features.

As shown in fig. 2, which is a schematic structural diagram of a multi-modal feature alignment network based on attention-directed provided in this embodiment, first, features obtained from a feature extraction network are connected together along a channel dimension, a convolution manner is adopted to reduce the number of channels connecting the features, and two parts of features are adaptively fused, and a specific implementation algorithm may be expressed as:

F_rgbd＝F_fc(X||D)

wherein X ∈ R^C×H×W，D∈R^C×H×WRespectively representing the color image features and the depth image features extracted from each layer of the feature extraction network, | | | represents that the two features are connected together along the channel dimension to obtain R^2C×H×WCharacteristic of size, F_fcRepresenting convolution operations, reducing the number of channels (R) for concatenated features^2C×H×W→R^C×H×W)，F_rgbd∈R^C×H×WIndicating a mixing characteristic.

Secondly, performing global average pooling on the mixed features in the horizontal direction and the vertical direction respectively to obtain a feature vector in the horizontal direction and a feature vector in the vertical direction respectively; connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and coded intermediate feature map; and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.

Specifically, in this embodiment, global average pooling is performed on the mixed features in the horizontal and vertical directions, the mixed features are converted into feature vectors in the two directions, compression and excitation are performed on the channel dimension, feature alignment on the channel dimension is realized, and spatial position information is retained in the feature alignment process. The specific implementation algorithm of this process can be expressed as the following formula:

f＝δ(F^h||F^w)

wherein the feature F after fusion_rgbd∈R^C×H×WPerforming global average pooling in horizontal and vertical directions to obtain a feature F^h(h)∈R^C×H×1，F^w(w)∈R^C×1×W, | | denotes the joining together of the two features along the channel dimension, δ denotes the nonlinear activation function that compresses,

the characteristic is an intermediate characteristic diagram obtained by compression-encoding spatial information in the horizontal direction and the vertical direction, and r is a coefficient for controlling a reduction ratio.

In an optional embodiment of this embodiment, further, a preset alignment repair algorithm is used to perform feature alignment on the compressed and encoded intermediate feature map, so as to obtain an aligned color map feature and an aligned depth map feature. That is, multiplication is carried out on the original features one by one to obtain a feature representation with less noise, and finally, the feature response of the output is improved at the corresponding position in a weighting mode.

The alignment repair algorithm is represented as:

wherein X ∈ R^C×H×W、D∈R^C×H×WRespectively representing the color image characteristics and the depth image characteristics obtained by the characteristic extraction network, representing sigmoid activation function by sigma,

and

representing the division of the compressed encoded intermediate feature map f into two independent feature maps, X, along the channel dimension_rep∈R^C×H×W、D_rep∈R^C×H×wRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics,

representing a convolution operation, passing

The corresponding convolution operation will f^hReduction to the feature of channel number C

Adaptively acquiring a weight matrix of the color image characteristics along the horizontal direction; through

The corresponding convolution operation will f^wReduction to the feature of channel number C

Adaptively acquiring a weight matrix of color image features in the vertical direction, by

Corresponding convolution operation, will f^hReduction to the feature of channel number C

Adaptively acquiring a weight matrix of the depth map features along the horizontal direction; through

Corresponding convolution operation, will f^wReduction to the feature of channel number C

And adaptively acquiring a weight matrix of the depth map features along the vertical direction.

Thus, the network can take advantage of the most useful visual appearance and geometry and thus effectively suppress noise features in the depth stream.

103, respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image features and a second weight matrix of the importance degree of each position point of the aligned depth image features through a cross-modal feature fusion network, then fusing the first weight matrix and the aligned color image features, and fusing the second weight matrix and the aligned depth image features and then performing superposition processing to obtain fusion features.

As shown in fig. 3, which is a schematic structural diagram of a cross-modal feature fusion network provided in this embodiment, aligned depth features and color features are compressed into one channel, and convolution is used to learn difference distributions of different position points of the color features and the depth features. In order to further obtain color features and depth features with complementarity, the feature map is divided into two feature maps along the channel direction, the importance degrees of each position point of the two feature maps are obtained by utilizing the complementation of the softmax function, and the feature points are respectively fused with the corresponding aligned features and then superposed, so that the color features and the depth map features can adaptively generate high-quality feature maps.

Specifically, in this embodiment, a preset difference distribution learning algorithm may be adopted through a cross-modal feature fusion network to respectively obtain a first weight matrix of importance degrees of each position point of aligned color image features and a second weight matrix of importance degrees of each position point of aligned depth image features;

the difference distribution learning algorithm is expressed as:

wherein, X_rep、D_repRespectively representing the aligned color map features and the aligned depth map features,

representing a convolution operation, passing

Corresponding convolution operation, input characteristic X with channel number C_repCompressed to a channel (R)^C×H×W→R^1×H×W) Through which is passed

Corresponding convolution operation to input characteristic D with channel number C_repCompressed to a channel (R)^C×H×W→R^1×H×W) And | l represents connection along the channel dimension, delta represents a softmax activation function, the relative importance degree of each position point of the depth map feature and the color map feature can be obtained in a complementary way through the function, the complementary fusion of the two features is realized, and G_X′∈R^1×H×W、G_D′∈R^1×H×WRespectively representing a first weight matrix and a second weight matrix.

Further, the fusion feature is obtained based on a preset fusion algorithm, and the fusion algorithm can be expressed as:

X_fusion＝G_X′×X_rep+G_D′×D_rep

wherein, X_rep∈R^C×H×W、D_rep∈R^C×H×WRepresenting features after alignment, X_fusion∈R^C×H×WRepresenting the fused output features.

And step 104, performing convolution operation and upsampling processing on the fusion features layer by layer through a multilayer feature fusion decoding network, and outputting a semantic segmentation graph.

In particular, in the encoder-decoder architecture, the loss of detail information may result due to multiple downsampling at the encoder stage. Assuming that the decoding stage directly utilizes the fusion features extracted by the backbone network at the final stage to perform upsampling, the finally obtained segmentation result is inaccurate in the boundary region, and even results in wrong segmentation in the boundary region. Based on this, as shown in fig. 4, which is a schematic structural diagram of a multi-layer feature fusion decoding network provided in this embodiment, a preset decoding algorithm is adopted by the multi-layer feature fusion decoding network, and the fusion features are subjected to convolution operation and upsampling processing layer by layer, so as to output a semantic segmentation map, where the decoding algorithm is expressed as:

G_fusion＝F_3,2,1+F_NBt1D(δ(F_c(G)))

wherein, F_3,,Representing the fusion characteristics obtained in the encoding phase, F_NBt1DThree layers NBt1D are shown, each NBt1D module comprises four convolution layers, is a deconvolution, and has smaller parameter and less operation quantity compared with the common convolution, wherein delta represents Relu activation function, F_cRepresenting a 3 x 3 convolutional layer, G representing the input characteristics of each stage of the decoding network, G_fusionAnd the semantic segmentation graph which represents the output of each stage is restored to the size of the original image through upsampling.

Therefore, the embodiment adds the fusion features of the encoding stage in the decoding stage, and utilizes the NBt1D module to add the fusion features into the decoder layer by layer, and the NBt1D module has the characteristic of light weight, so that a high-resolution accurate semantic segmentation map can be constructed without introducing large parameters.

According to the image semantic segmentation method for attention-guided multi-modal feature fusion provided by the embodiment of the invention, firstly, features are extracted from an RGB (red, green and blue) image and a depth image by using a feature extraction network, and the feature refining function is controlled by using the attention. In addition, in order to better fuse two-part features, the invention utilizes the complementarity of the two parts to realize cross-complementary aggregation of features in a space dimension, and finally realizes cross-modal feature fusion. In the encoding stage, due to the fact that multiple times of downsampling may cause loss of detail information, the upsampling in the decoding stage can introduce fusion features with stronger representation capability obtained in the encoding stage. In addition, experiments are carried out on the RGB-D semantic segmentation data sets Nyuv2 and SUNRGBD, and the results show that the method is better than most of the existing methods in the RGB-D semantic segmentation task, and has the advantages of good segmentation effect and small parameter quantity.

Second embodiment:

in order to solve the technical problems of poor robustness, low segmentation accuracy and the like of an RGB-D image semantic segmentation model provided in the related art, this embodiment shows an image semantic segmentation apparatus, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and specifically refer to fig. 5, the image semantic segmentation apparatus of this embodiment includes:

an extraction module 501, configured to perform feature extraction processing on the color image and the corresponding depth image through a feature extraction network, respectively, to obtain a color image feature and a depth image feature;

an alignment module 502, configured to connect the color image features and the depth image features along the channel dimension through a multi-modal feature alignment network, perform convolution operation to obtain mixed features, and align the color image features and the depth image features in the channel dimension and the spatial dimension based on the mixed features to obtain aligned color image features and aligned depth image features;

the fusion module 503 is configured to obtain a first weight matrix of importance degrees of each position point of the aligned color image feature and a second weight matrix of importance degrees of each position point of the aligned depth image feature through a cross-modal feature fusion network, and then fuse the first weight matrix with the aligned color image feature and fuse the second weight matrix with the aligned depth image feature and then perform superposition processing to obtain a fusion feature;

and the decoding module 504 is configured to perform convolution operation and upsampling processing on the fusion features layer by layer through a multi-layer feature fusion decoding network, and output a semantic segmentation map.

In an optional implementation manner of this embodiment, the alignment module is specifically configured to: performing global average pooling on the mixed features in the horizontal direction and the vertical direction respectively to obtain a feature vector in the horizontal direction and a feature vector in the vertical direction respectively; connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and coded intermediate feature map; and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.

Further, in an optional implementation manner of this embodiment, the alignment module is specifically configured to: performing feature alignment on the compressed and coded intermediate feature map by adopting a preset alignment repair algorithm to obtain aligned color map features and aligned depth map features;

the alignment repair algorithm is represented as:

wherein X ∈ R^C×H×W、D∈R^C×H×WRespectively representing feature extraction network to obtain color image features and depthGraph characteristics, σ denotes a sigmoid activation function,

and

representing two independent feature maps into which the compressed and encoded intermediate feature map is divided along the channel dimension,

representing a convolution operation, X_rep∈R^C×H×W、D_rep∈R^C×H×WRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics.

In an optional implementation manner of this embodiment, the fusion module is specifically configured to: respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image characteristic and a second weight matrix of the importance degree of each position point of the aligned depth image characteristic by adopting a preset difference distribution learning algorithm through a cross-modal characteristic fusion network;

the difference distribution learning algorithm is expressed as:

representing convolution operations, | | representing connections along the channel dimension, | representing the softmax activation function, G_X′∈R^1×H×W、G_D′∈R^1×H×WRespectively representing a first weight matrix and a second weight matrix.

In an optional implementation manner of this embodiment, the decoding module is specifically configured to: performing convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through a multilayer feature fusion decoding network, and outputting a semantic segmentation graph;

the decoding algorithm is represented as:

G_fusion＝F_3,2,1+F_NBt1D(δ(F_c(G)))

wherein, F_3,2,1Representing the fusion characteristics of the encoding phase, F_NBt1DThree layers NBt1D are shown, each NBt1D module containing four convolutional layers, δ representing the Relu activation function, F_cRepresenting a 3 × 3 convolutional layer, G representing the input characteristics of each stage, G_fusionRepresenting a semantic segmentation graph.

It should be noted that, the image semantic segmentation method in the foregoing embodiment can be implemented based on the image semantic segmentation device provided in this embodiment, and it can be clearly understood by those skilled in the art that, for convenience and simplicity of description, a specific working process of the image semantic segmentation device described in this embodiment may refer to a corresponding process in the foregoing method embodiment, and details are not described here.

The image semantic segmentation device for guiding multi-modal feature fusion by adopting the attention provided by the embodiment is used for mixing the extracted color image features and the extracted depth image features; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating noise existing in a depth map, and aligning the two parts of features in a self-adaptive manner; in order to further complementarily fuse the two part features, the complementary fusion of the multi-modal features is realized by acquiring the importance degrees of the corresponding positions of the two parts of features and learning the complementary relation between the color image and the depth image in a self-adaptive manner; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, more detail information is added, more information is focused on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.

The third embodiment:

the present embodiment provides an electronic device, as shown in fig. 6, which includes a processor 601, a memory 602, and a communication bus 603, wherein: the communication bus 603 is used for realizing connection communication between the processor 601 and the memory 602; the processor 601 is configured to execute one or more computer programs stored in the memory 602 to implement at least one step of the image semantic segmentation method in the first embodiment.

The present embodiments also provide a computer-readable storage medium including volatile or non-volatile, removable or non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media include, but are not limited to, RAM (Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact disk Read-Only Memory), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.

The computer-readable storage medium in this embodiment may be used for storing one or more computer programs, and the stored one or more computer programs may be executed by a processor to implement at least one step of the method in the first embodiment.

The present embodiment also provides a computer program, which can be distributed on a computer readable medium and executed by a computing device to implement at least one step of the method in the first embodiment; and in some cases at least one of the steps shown or described may be performed in an order different than that described in the embodiments above.

The present embodiments also provide a computer program product comprising a computer readable means on which a computer program as shown above is stored. The computer readable means in this embodiment may include a computer readable storage medium as shown above.

It will be apparent to those skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software (which may be implemented in computer program code executable by a computing device), firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit.

In addition, communication media typically embodies computer readable instructions, data structures, computer program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to one of ordinary skill in the art. Thus, the present invention is not limited to any specific combination of hardware and software.

The foregoing is a more detailed description of embodiments of the present invention, and the present invention is not to be considered limited to such descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. An attention-guided multi-modal feature fusion image semantic segmentation method is applied to an overall neural network comprising a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and is characterized by comprising the following steps of:

2. The image semantic segmentation method according to claim 1, wherein the step of aligning the color map features and the depth map features in a channel dimension and a space dimension based on the mixture features to obtain aligned color map features and aligned depth map features comprises:

respectively carrying out global average pooling on the mixed features in the horizontal direction and the vertical direction to respectively obtain a feature vector in the horizontal direction and a feature vector in the vertical direction;

connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along a channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and encoded intermediate feature map;

and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.

3. The image semantic segmentation method according to claim 2, wherein the step of performing feature alignment based on the compressed and encoded intermediate feature map to obtain an aligned color map feature and an aligned depth map feature comprises:

performing feature alignment on the compressed and coded intermediate feature map by adopting a preset alignment repair algorithm to obtain aligned color map features and aligned depth map features;

the alignment repair algorithm is represented as:

and

4. The image semantic segmentation method according to claim 1, wherein the step of obtaining a first weight matrix of the importance degree of each position point of the aligned color image feature and a second weight matrix of the importance degree of each position point of the aligned depth image feature through the cross-modal feature fusion network respectively comprises:

respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image features and a second weight matrix of the importance degree of each position point of the aligned depth image features by adopting a preset difference distribution learning algorithm through the cross-modal feature fusion network;

the difference distribution learning algorithm is expressed as:

representing convolution operations, | | representing connections along the channel dimension, | representing the softmax activation function, G_X′∈R^1×H×W、G_D′∈R^1×H×WRespectively representing the first weight matrix and the second weight matrix.

5. The image semantic segmentation method according to any one of claims 1 to 4, wherein the step of performing convolution operation and upsampling processing on the fused features layer by layer through the multi-layer feature fusion decoding network to output a semantic segmentation map comprises:

performing convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph;

the decoding algorithm is represented as:

G_fusion＝F_3，2，1+F_NBt1D(δ(F_c(G)))

wherein, F_3，2，1Said fusion characteristic representing the encoding phase, F_NBt1DThree layers NBt1D are shown, each of the NBt1D modules containing four convolutional layers, δ representing the Relu activation function, F_cRepresenting a 3 × 3 convolutional layer, G representing the input characteristics of each stage, G_fusionRepresenting the semantic segmentation graph.

6. An attention-guided multi-modal feature fusion image semantic segmentation device is applied to an overall neural network comprising a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and is characterized by comprising:

7. The image semantic segmentation apparatus according to claim 6, wherein the alignment module is specifically configured to:

8. The image semantic segmentation apparatus of claim 7, wherein the alignment module is further configured to:

the alignment repair algorithm is represented as:

and

9. The image semantic segmentation apparatus according to claim 6, wherein the fusion module is specifically configured to:

the difference distribution learning algorithm is expressed as:

10. The image semantic segmentation apparatus according to any one of claims 6 to 9, wherein the decoding module is specifically configured to:

the decoding algorithm is represented as:

G_fusion＝F_3，2，1+F_NBt1D(δ(F_c(G)))