CN114372986B

CN114372986B - Image semantic segmentation method and device for attention-guided multi-modal feature fusion

Info

Publication number: CN114372986B
Application number: CN202111658857.9A
Authority: CN
Inventors: 钦闯; 邹文斌; 田时舜; 李霞; 邹辉
Original assignee: Huishi Innovation Shenzhen Co ltd; Shenzhen University
Current assignee: Huishi Innovation Shenzhen Co ltd; Shenzhen University
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2024-05-24
Anticipated expiration: 2041-12-30
Also published as: CN114372986A

Abstract

According to the image semantic segmentation method and the image semantic segmentation device for attention-guided multi-mode feature fusion disclosed by the embodiment of the invention, extracted color map features and depth map features are mixed; refining the mixed features in two dimensions of a channel and a space, superposing the mixed features into input features, eliminating depth map noise, and adaptively aligning the two features; in order to further complementarily fuse the two parts of features, the complementary relation between the color map and the depth map is adaptively learned by acquiring the importance degrees of the corresponding positions of the two features, so that the complementary fusion of the multi-mode features is realized; in order to introduce important space detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features of a coding stage, more detail information is added, more information is focused on a boundary region during the segmentation, and fine segmentation of the boundary region is realized, so that a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.

Description

Image semantic segmentation method and device for attention-guided multi-modal feature fusion

Technical Field

The invention relates to the technical field of image processing, in particular to an image semantic segmentation method and device for attention-guided multi-mode feature fusion.

Background

The semantic segmentation aims at accurately classifying each pixel point in the image, is a pixel-level classification method, and is widely applied to multiple fields of vision-based automatic driving, man-machine interaction, medical image segmentation, three-dimensional map reconstruction and the like. The accurate pixel classification is adopted to effectively acquire scene information in the image, the segmentation result not only can obtain the specific position of each target in the image, but also can further obtain the category of each target and the state of each target, and the acquired image information is utilized to enable a computer to automatically understand the scene, so that the method is one of the most challenging tasks in computer vision. In recent years, due to the appearance of a depth camera, such as Intel REALSENSE camera, microsoft Kinect camera, the depth camera provides convenience for obtaining depth information of an image, and the depth information is also widely used for improving semantic segmentation performance. Compared with the color map, the depth information provides semantic information and also provides actual scene object size and geometric information, so that the performance of semantic segmentation is further improved.

For the semantic segmentation of RGB-D, at present, a plurality of methods mainly generate features with more characterization capability to improve the performance of the semantic segmentation of RGB-D images by fusing RGB image features and depth image features, and generally adopt a coding and decoding structure, wherein the structure can be divided into early fusion, medium fusion and later fusion according to the fusion stage. Most of fusion modules adopted by the methods are used for directly fusing the depth map features and the color map features, and the depth information cannot be fully utilized, so that complementary fusion of the color map features and the depth map features is realized. Meanwhile, as imaging of a REALSENSE depth camera is affected by factors such as illumination, sliding surface, hardware interference and the like, the problems of boundary blurring, large cavity area and the like of a depth image are caused, noise existing in depth information cannot be eliminated by a method for directly fusing the features of two modes, interference features are introduced into a network model, finally, segmentation accuracy is reduced, and poor robustness is achieved.

Disclosure of Invention

The embodiment of the invention mainly aims to provide an image semantic segmentation method and device for attention-guided multi-mode feature fusion, which at least can solve the problems of poor robustness, low segmentation precision and the like of an RGB-D image semantic segmentation model provided in the related technology.

To achieve the above object, a first aspect of the present invention provides an image semantic segmentation method for attention-directed multi-modal feature fusion, which is applied to an integral neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network, and a multi-layer feature fusion decoding network, and the method includes:

respectively carrying out feature extraction processing on the color map and the corresponding depth map through the feature extraction network to obtain color map features and depth map features;

After the color map features and the depth map features are connected along a channel dimension through the multi-modal feature alignment network, convolution operation is carried out to obtain mixed features, and alignment is carried out on the color map features and the depth map features in the channel dimension and the space dimension based on the mixed features to obtain aligned color map features and aligned depth map features;

Acquiring a first weight matrix of the importance degree of each position point of the aligned color map features and a second weight matrix of the importance degree of each position point of the aligned depth map features through the cross-modal feature fusion network, then fusing the first weight matrix with the aligned color map features, and then performing superposition processing after fusing the second weight matrix with the aligned depth map features to obtain fusion features;

And carrying out convolution operation and up-sampling processing on the fusion features layer by layer through the multi-layer feature fusion decoding network, and outputting a semantic segmentation graph.

To achieve the above object, a second aspect of the embodiments of the present invention provides an image semantic segmentation apparatus for attention-directed multi-modal feature fusion, which is applied to an integral neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network, and a multi-layer feature fusion decoding network, and includes:

The extraction module is used for respectively carrying out feature extraction processing on the color map and the corresponding depth map through the feature extraction network to obtain color map features and depth map features;

the alignment module is used for carrying out convolution operation to obtain a mixed feature after the color map feature and the depth map feature are connected along a channel dimension through the multi-modal feature alignment network, and carrying out alignment on the color map feature and the depth map feature in the channel dimension and the space dimension based on the mixed feature to obtain an aligned color map feature and an aligned depth map feature;

The fusion module is used for respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color map features and a second weight matrix of the importance degree of each position point of the aligned depth map features through the cross-modal feature fusion network, then fusing the first weight matrix with the aligned color map features and then superposing the second weight matrix with the aligned depth map features to obtain fusion features;

and the decoding module is used for carrying out convolution operation and up-sampling processing on the fusion features layer by layer through the multi-layer feature fusion decoding network and outputting a semantic segmentation graph.

According to the image semantic segmentation method and the image semantic segmentation device for attention-guided multi-mode feature fusion, which are provided by the embodiment of the invention, the extracted color map features and depth map features are mixed; refining the mixed features in two dimensions of a channel and a space, and superposing the refined features into input features, eliminating noise existing in a depth map, and adaptively aligning the two features; in order to further complementarily fuse the two parts of features, the complementary relation between the color map and the depth map is adaptively learned by acquiring the importance degrees of the corresponding positions of the two features, so that the complementary fusion of the multi-mode features is realized; in order to introduce important space detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features of an encoding stage, more detail information is added, more information is focused on a boundary region during the segmentation, and fine segmentation of the boundary region is realized, so that a more accurate and efficient semantic segmentation diagram is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.

Additional features and corresponding effects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other drawings may be obtained from them without inventive effort for a person skilled in the art.

Fig. 1 is a basic flow diagram of an image semantic segmentation method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-modal feature alignment network based on attention guidance according to a first embodiment of the present invention;

Fig. 3 is a schematic structural diagram of a cross-modal feature fusion network according to a first embodiment of the present invention;

Fig. 4 is a schematic structural diagram of a multi-layer feature fusion decoding network according to a first embodiment of the present invention;

FIG. 5 is a schematic diagram of a program module of an image semantic segmentation apparatus according to a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions in the embodiments of the present invention will be clearly described in conjunction with the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

First embodiment:

In order to solve the technical problems of poor robustness, low segmentation precision and the like of the RGB-D image semantic segmentation model provided in the related art, the embodiment provides an image semantic segmentation method, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, as shown in fig. 1, which is a basic flow diagram of the image semantic segmentation method provided in the embodiment, and the image semantic segmentation method provided in the embodiment includes the following steps:

and step 101, respectively carrying out feature extraction processing on the color map and the corresponding depth map through a feature extraction network to obtain color map features and depth map features.

Specifically, taking a Kinect camera as an example, a color camera and an infrared camera are configured, where the color camera is used for capturing a color map, and the infrared camera is used for capturing a depth map at the same time, and the depth map can provide more geometric information and spatial information.

Step 102, after the color map features and the depth map features are connected along the channel dimension through a multi-modal feature alignment network, convolution operation is performed to obtain mixed features, and alignment is performed on the color map features and the depth map features in the channel dimension and the space dimension based on the mixed features to obtain aligned color map features and aligned depth map features.

Specifically, noise information inevitably exists in the depth map in the acquisition process, noise existing in the depth features cannot be considered in the conventional fusion method, and the embodiment provides a multi-mode feature alignment network which can effectively eliminate noise features and realize alignment of two features.

In this embodiment, features acquired from a feature extraction network are first connected together along a channel dimension, and a convolution manner is adopted to reduce the number of channels connecting the features and adaptively blend the two features, so that a specific implementation algorithm can be expressed as:

F_rgbd＝F_fc(X||D)

Wherein X ε R ^C×H×W,D∈R^C×H×W represents the color map feature and depth map feature extracted from each layer of the feature extraction network respectively, I represents connecting the two features together along the channel dimension to obtain the feature of R ^2C×H×W, F _fc represents convolution operation, and the number of channels is reduced for the connected features (R ^2C×H×W→R^C×H×W),F_rgbd∈R^C×H×W represents a mixed feature).

Then, carrying out global average pooling on the mixed features in the horizontal direction and the vertical direction respectively to obtain feature vectors in the horizontal direction and feature vectors in the vertical direction respectively; connecting the feature vector in the horizontal direction and the feature vector in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and encoded intermediate feature map; and carrying out feature alignment based on the compression-coded intermediate feature map to obtain aligned color map features and aligned depth map features.

Specifically, the embodiment respectively carries out global average pooling on the mixed features in the horizontal direction and the vertical direction, converts the features into feature vectors in the two directions, compresses and excites the feature vectors in the channel dimension, realizes feature alignment in the channel dimension, and simultaneously reserves space position information in the process of feature alignment. The specific implementation algorithm of this process can be expressed as the following formula:

f＝δ(F^h||F^w)

Wherein the fused feature F _rgbd∈R^C×H×W is subjected to global average pooling in two directions, namely horizontal and vertical, to obtain a feature F ^h(h)∈R^C×H×1,F^w(w)∈R^C×1×W, wherein I represents connecting the two features together along the channel dimension, delta represents a nonlinear activation function for compression, Is an intermediate feature map obtained by compression-encoding spatial information in the horizontal direction and the vertical direction, and r is a coefficient for controlling the reduction ratio.

In an optional embodiment of this embodiment, further, a preset alignment repair algorithm is used to perform feature alignment on the compression-coded intermediate feature map, so as to obtain an aligned color map feature and an aligned depth map feature. That is, multiplication is performed on the original features channel by channel to obtain a less noisy feature representation, and finally the output feature response is improved at the corresponding position by means of weighting.

The alignment repair algorithm is expressed as:

wherein X epsilon R ^C×H×W、D∈R^C×H×W respectively represents a color map feature and a depth map feature obtained by the feature extraction network, sigma represents a sigmoid activation function, And/>Representing two independent feature maps dividing the compression-encoded intermediate feature map f along the channel dimension, X _rep∈R^C×H×W、D_rep∈R^C×H×w representing the final output aligned color map features and aligned depth map features, respectively,/>Representing convolution operations, throughThe corresponding convolution operation reduces f ^h to the characteristic/>, with the channel number CAdaptively acquiring a weight matrix of the color map features along the horizontal direction; pass/>The corresponding convolution operation reduces f ^w to the characteristic/>, with the channel number CSelf-adaptively acquiring weight matrix of color map features along vertical direction throughCorresponding convolution operation, restore f ^h to the feature of channel number C/>Adaptively acquiring a weight matrix of the depth map features along the horizontal direction; pass/>Corresponding convolution operation, restore f ^w to the feature of channel number C/>And adaptively acquiring a weight matrix of the depth map features along the vertical direction.

Thus, the network can take advantage of the most useful visual appearance and geometry, and thus effectively suppress noise features in the depth stream.

Step 103, respectively obtaining a first weight matrix of the importance degree of each position point of the aligned color map feature and a second weight matrix of the importance degree of each position point of the aligned depth map feature through a cross-modal feature fusion network, then fusing the first weight matrix with the aligned color map feature, and then performing superposition processing after fusing the second weight matrix with the aligned depth map feature to obtain a fusion feature.

Fig. 3 is a schematic structural diagram of a cross-modal feature fusion network provided in this embodiment, aligned depth features and color map features are compressed to a channel, and a convolution is adopted to learn the differential distribution of different positions of the color features and the depth features. In order to further acquire the color features and the depth features with complementarity, the feature map is segmented into two feature maps along the channel direction, the importance degree of each position point of the two feature maps is acquired by utilizing the softmax function complementation, and the importance degree is respectively fused with the corresponding aligned features and then overlapped, so that the color map features and the depth map features can adaptively generate high-quality feature maps.

Specifically, in this embodiment, a preset differential distribution learning algorithm may be adopted by a cross-modal feature fusion network to respectively obtain a first weight matrix of importance degrees of all position points of the aligned color map features and a second weight matrix of importance degrees of all position points of the aligned depth map features;

the differential distribution learning algorithm is expressed as:

Wherein X _rep、D_rep represents the aligned color map features and the aligned depth map features, respectively, Representing convolution operations, pass/>Corresponding convolution operation compresses the input feature X _rep with channel number C into one channel (R ^C ^×H×W→R^1×H×W), passing/>Corresponding convolution operation is carried out, an input feature D _rep with the channel number of C is compressed to one channel (R ^C ^×H×W→R^1×H×W), I is connected along the channel dimension, delta is a softmax activation function, the relative importance degree of each position point of the depth map feature and the color map feature can be obtained complementarily through the function, the complementation fusion of the two features is realized, and G _X′∈R¹ ^×H×W、G_D′∈R^1×H×W respectively represents a first weight matrix and a second weight matrix.

Further, the fusion feature is obtained based on a preset fusion algorithm, and the fusion algorithm can be expressed as:

X_fusion＝G_X′×X_rep+G_D′×D_rep

where X _rep∈R^C×H×W、D_rep∈R^C×H×W represents the aligned features and X _fusion∈R^C×H×W represents the fused output features.

And 104, carrying out convolution operation and up-sampling processing on the fusion features layer by layer through a multi-layer feature fusion decoding network, and outputting a semantic segmentation graph.

In particular, in the encoder-decoder architecture, loss of detail information may result from multiple downsampling of the encoder stage. Assuming that the decoding stage directly uses the fusion features extracted in the final stage by the backbone network to perform upsampling, the final segmentation result is inaccurate in the boundary region, and even leads to erroneous segmentation in the boundary region. Based on this, as shown in fig. 4, a schematic structural diagram of a multi-layer feature fusion decoding network provided in this embodiment is shown, in this embodiment, a preset decoding algorithm is adopted by the multi-layer feature fusion decoding network, and the fusion features are subjected to convolution operation and upsampling layer by layer, so as to output a semantic segmentation graph, where the decoding algorithm is expressed as:

G_fusion＝F_3,2,1+F_NBt1D(δ(F_c(G)))

Wherein, F _3,, represents the fusion feature obtained in the encoding stage, F _NBt1D represents three layers NBt D modules, each NBt D module contains four convolution layers, which is a decomposition convolution, compared with the common convolution, the parameter and the operation amount are smaller, δ represents Relu activation function, F _c represents 3×3 convolution layers, G represents the input feature of each stage of the decoding network, G _fusion represents the semantic segmentation map output by each stage, and the size of the original image is restored by upsampling.

Therefore, the embodiment adds the fusion characteristic of the encoding stage in the decoding stage, adds the fusion characteristic into the decoder layer by utilizing the NBt D module, and the NBt D module has the characteristic of light weight, and can construct a high-resolution accurate semantic segmentation map under the condition of not introducing a larger parameter number.

According to the image semantic segmentation method for attention-guided multi-mode feature fusion, which is provided by the embodiment of the invention, firstly, a feature extraction network is utilized to extract features from an RGB image and a depth image, and the attention mechanism is utilized to refine the features. In addition, in order to better fuse the two parts of features, the invention utilizes the complementarity of the two parts to realize cross complementation aggregation of the features in the space dimension, and finally realizes cross-modal feature fusion. In the encoding stage, because the loss of detail information can be caused by repeated downsampling, the upsampling in the decoding stage can introduce the fusion characteristic with stronger characterization capability obtained in the encoding stage. In addition, experiments are carried out on RGB-D semantic segmentation datasets Nyuv and SUNRGBD, and the results show that the method is better than most existing methods in RGB-D semantic segmentation tasks, and has the advantages of good segmentation effect and less parameter quantity.

Second embodiment:

In order to solve the technical problems of poor robustness, low segmentation precision and the like of the RGB-D image semantic segmentation model provided in the related art, the embodiment shows an image semantic segmentation device applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and referring specifically to fig. 5, the image semantic segmentation device of the embodiment includes:

the extracting module 501 is configured to perform feature extraction processing on the color map and the corresponding depth map through a feature extraction network, so as to obtain a color map feature and a depth map feature;

The alignment module 502 is configured to connect the color map feature and the depth map feature along the channel dimension through the multi-modal feature alignment network, perform convolution operation to obtain a hybrid feature, and align the color map feature and the depth map feature in the channel dimension and the space dimension based on the hybrid feature to obtain an aligned color map feature and an aligned depth map feature;

the fusion module 503 is configured to obtain a first weight matrix of importance degrees of each position point of the aligned color map feature and a second weight matrix of importance degrees of each position point of the aligned depth map feature through a cross-modal feature fusion network, then fuse the first weight matrix with the aligned color map feature and fuse the second weight matrix with the aligned depth map feature, and then perform superposition processing to obtain a fusion feature;

and the decoding module 504 is used for carrying out convolution operation and up-sampling processing on the fusion features layer by layer through the multi-layer feature fusion decoding network, and outputting a semantic segmentation graph.

In an alternative implementation manner of this embodiment, the alignment module is specifically configured to: global average pooling is carried out on the mixed features in the horizontal direction and the vertical direction respectively, so that feature vectors in the horizontal direction and feature vectors in the vertical direction are obtained respectively; connecting the feature vector in the horizontal direction and the feature vector in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and encoded intermediate feature map; and carrying out feature alignment based on the compression-coded intermediate feature map to obtain aligned color map features and aligned depth map features.

Further, in an optional implementation manner of this embodiment, the alignment module is specifically configured to: performing feature alignment on the compression-coded intermediate feature map by adopting a preset alignment restoration algorithm to obtain aligned color map features and aligned depth map features;

the alignment repair algorithm is expressed as:

wherein X epsilon R ^C×H×W、D∈R^C×H×W respectively represents a color map feature and a depth map feature obtained by the feature extraction network, sigma represents a sigmoid activation function, And/>Representing two independent feature maps of the compression-encoded intermediate feature map divided along the channel dimension,/>Representing a convolution operation, X _rep∈R^C×H×W、D_rep∈R^C×H×W represents the final output aligned color map features and aligned depth map features, respectively.

In an optional implementation manner of this embodiment, the fusion module is specifically configured to: a preset differential distribution learning algorithm is adopted through a cross-modal feature fusion network, and a first weight matrix of the importance degree of each position point of the aligned color map features and a second weight matrix of the importance degree of each position point of the aligned depth map features are respectively obtained;

the differential distribution learning algorithm is expressed as:

Wherein X _rep、D_rep represents the aligned color map features and the aligned depth map features, respectively, Representing a convolution operation, || representing a connection along the channel dimension, δ representing a softmax activation function, and G _X′∈R^1×H×W、G_D′∈R^1×H×W representing a first weight matrix and a second weight matrix, respectively.

In an alternative implementation manner of this embodiment, the decoding module is specifically configured to: carrying out convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through a multi-layer feature fusion decoding network, and outputting a semantic segmentation graph;

The decoding algorithm is expressed as:

G_fusion＝F_3,2,1+F_NBt1D(δ(F_c(G)))

Wherein, F _3,2,1 represents the fusion feature of the encoding stage, F _NBt1D represents three layers NBt D modules, each NBt D module contains four convolution layers, δ represents Relu activation functions, F _c represents 3×3 convolution layers, G represents the input feature of each stage, and G _fusion represents a semantic segmentation map.

It should be noted that, the image semantic segmentation method in the foregoing embodiments may be implemented based on the image semantic segmentation device provided in the present embodiment, and those skilled in the art can clearly understand that, for convenience and brevity of description, the specific working process of the image semantic segmentation device described in the present embodiment may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

The image semantic segmentation device for attention-guided multi-mode feature fusion provided by the embodiment is adopted to mix the extracted color map features and depth map features; refining the mixed features in two dimensions of a channel and a space, and superposing the refined features into input features, eliminating noise existing in a depth map, and adaptively aligning the two features; in order to further complementarily fuse the two parts of features, the complementary relation between the color map and the depth map is adaptively learned by acquiring the importance degrees of the corresponding positions of the two features, so that the complementary fusion of the multi-mode features is realized; in order to introduce important space detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features of an encoding stage, more detail information is added, more information is focused on a boundary region during the segmentation, and fine segmentation of the boundary region is realized, so that a more accurate and efficient semantic segmentation diagram is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.

Third embodiment:

The present embodiment provides an electronic device, referring to fig. 6, which includes a processor 601, a memory 602, and a communication bus 603, wherein: a communication bus 603 for enabling connected communication between the processor 601 and the memory 602; the processor 601 is configured to execute one or more computer programs stored in the memory 602 to implement at least one step of the image semantic segmentation method in the first embodiment.

The present embodiments also provide a computer-readable storage medium including volatile or nonvolatile, removable or non-removable media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media includes, but is not limited to, RAM (Random Access Memory ), ROM (Read-Only Memory), EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY, charged erasable programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact Disc Read-Only Memory), digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.

The computer readable storage medium in this embodiment may be used to store one or more computer programs, where the stored one or more computer programs may be executed by a processor to implement at least one step of the method in the first embodiment.

The present embodiment also provides a computer program which can be distributed on a computer readable medium and executed by a computable device to implement at least one step of the method of the above embodiment; and in some cases at least one of the steps shown or described may be performed in a different order than that described in the above embodiments.

The present embodiment also provides a computer program product comprising computer readable means having stored thereon a computer program as shown above. The computer readable means in this embodiment may comprise a computer readable storage medium as shown above.

It will be apparent to one skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the apparatus disclosed above may be implemented as software (which may be implemented in computer program code executable by a computing apparatus), firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit.

Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, computer program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media. Therefore, the present invention is not limited to any specific combination of hardware and software.

The foregoing is a further detailed description of embodiments of the invention in connection with the specific embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. An image semantic segmentation method for directing attention to multi-modal feature fusion is applied to an integral neural network comprising a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and is characterized by comprising the following steps:

2. The image semantic segmentation method according to claim 1, wherein the step of aligning the color map features and the depth map features in a channel dimension and a space dimension based on the hybrid features to obtain aligned color map features and aligned depth map features comprises:

Global average pooling is carried out on the mixed features in the horizontal direction and the vertical direction respectively, so that feature vectors in the horizontal direction and feature vectors in the vertical direction are obtained respectively;

connecting the feature vector in the horizontal direction and the feature vector in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compression-coded intermediate feature map;

And carrying out feature alignment based on the compression-coded intermediate feature map to obtain aligned color map features and aligned depth map features.

3. The image semantic segmentation method according to claim 2, wherein the step of performing feature alignment based on the compression-encoded intermediate feature map to obtain aligned color map features and aligned depth map features comprises:

Performing feature alignment on the compression-coded intermediate feature map by adopting a preset alignment restoration algorithm to obtain aligned color map features and aligned depth map features;

The alignment repair algorithm is expressed as:

Wherein X ε R ^C×H×W、D∈R^C×H×W represents the feature extraction network to obtain the color map feature and the depth map feature, σ represents a sigmoid activation function, And/>Representing two independent feature maps of the compression-coded intermediate feature map divided along the channel dimension,/>Representing a convolution operation, X _rep∈R^C×H×W、D_rep∈R^C×H×W represents the final output aligned color map features and aligned depth map features, respectively.

4. The image semantic segmentation method according to claim 1, wherein the step of obtaining, through the cross-modal feature fusion network, a first weight matrix of importance levels of respective position points of the aligned color map features and a second weight matrix of importance levels of respective position points of the aligned depth map features, respectively, includes:

A preset differential distribution learning algorithm is adopted through the cross-modal feature fusion network, and a first weight matrix of the importance degree of each position point of the aligned color map features and a second weight matrix of the importance degree of each position point of the aligned depth map features are respectively obtained;

The differential distribution learning algorithm is expressed as:

Wherein X _rep、D_rep represents the aligned color map feature and the aligned depth map feature, respectively, Representing a convolution operation, || representing a connection along a channel dimension, δ representing a softmax activation function, and G _X′∈R^1×H×W、G_D′∈R^1×H×W representing the first weight matrix and the second weight matrix, respectively.

5. The image semantic segmentation method according to any one of claims 1 to 4, wherein the step of performing convolution operation and upsampling processing on the fused features layer by layer through the multi-layer feature fusion decoding network, and outputting a semantic segmentation map, includes:

The multi-layer feature fusion decoding network adopts a preset decoding algorithm, the fusion features are subjected to convolution operation and up-sampling processing layer by layer, and a semantic segmentation graph is output;

The decoding algorithm is expressed as:

G_fusion＝F_3,2,1+F_NBt1D(δ(F_c(G)))

Wherein F _3,2,1 represents the fusion features of the encoding stages, F _NBt1D represents three layers NBt D modules, each NBt D module contains four convolution layers, δ represents Relu activation functions, F _c represents 3×3 convolution layers, G represents the input features of each stage, and G _fusion represents the semantic segmentation map.

6. An attention-directed multi-modal feature fusion image semantic segmentation device applied to an overall neural network comprising a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, comprising:

7. The image semantic segmentation apparatus according to claim 6, wherein the alignment module is specifically configured to:

8. The image semantic segmentation apparatus as set forth in claim 7, wherein the alignment module is further configured to:

The alignment repair algorithm is expressed as:

9. The image semantic segmentation apparatus according to claim 6, wherein the fusion module is specifically configured to:

The differential distribution learning algorithm is expressed as:

10. The image semantic segmentation apparatus according to any one of claims 6 to 9, wherein said decoding module is specifically configured to:

The decoding algorithm is expressed as:

G_fusion＝F_3,2,1+F_NBt1D(δ(F_c(G)))