CN114372986A - Attention-guided multi-modal feature fusion image semantic segmentation method and device - Google Patents
Attention-guided multi-modal feature fusion image semantic segmentation method and device Download PDFInfo
- Publication number
- CN114372986A CN114372986A CN202111658857.9A CN202111658857A CN114372986A CN 114372986 A CN114372986 A CN 114372986A CN 202111658857 A CN202111658857 A CN 202111658857A CN 114372986 A CN114372986 A CN 114372986A
- Authority
- CN
- China
- Prior art keywords
- features
- feature
- fusion
- aligned
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 96
- 230000011218 segmentation Effects 0.000 title claims abstract description 80
- 238000000034 method Methods 0.000 title claims abstract description 36
- 239000011159 matrix material Substances 0.000 claims description 46
- 238000000605 extraction Methods 0.000 claims description 28
- 230000006870 function Effects 0.000 claims description 22
- 230000004913 activation Effects 0.000 claims description 17
- 239000013598 vector Substances 0.000 claims description 17
- 238000009826 distribution Methods 0.000 claims description 9
- 230000008439 repair process Effects 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 239000000203 mixture Substances 0.000 claims 1
- 230000000295 complement effect Effects 0.000 abstract description 9
- 238000007500 overflow downdraw method Methods 0.000 abstract description 4
- 238000007670 refining Methods 0.000 abstract description 4
- 230000000875 corresponding effect Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 9
- 238000003860 storage Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10004—Still image; Photographic image
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
According to the image semantic segmentation method and device for attention-directed multi-modal feature fusion disclosed by the embodiment of the invention, the extracted color image features and depth image features are mixed; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating the noise of a depth map, and adaptively aligning the two features; in order to further complementarily fuse the two part features, the complementary relation between the color image and the depth image is self-adaptively learned by acquiring the importance degrees of the corresponding positions of the two part features, so that the complementary fusion of the multi-mode features is realized; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, and more detail information is added, so that more information focuses on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to an image semantic segmentation method and device for attention-guided multi-modal feature fusion.
Background
The semantic segmentation aims at accurately classifying each pixel point in an image, is a pixel-level classification method, and is widely applied to a plurality of fields of vision-based automatic driving, man-machine interaction, medical image segmentation, three-dimensional map reconstruction and the like. The scene information in the image is effectively acquired through accurate pixel classification, the specific position of each target in the image can be acquired through the segmentation result, the category and the state of each target can be further acquired, and the automatic understanding of the scene by a computer through the acquired image information is one of the most challenging tasks in computer vision. In recent years, depth cameras, such as Intel's Realsense camera, microsoft's Kinect camera, and depth camera, have been widely used to improve semantic segmentation performance. Compared with the color image, the depth information provides semantic information and also provides the size and geometric information of an object in an actual scene, and the semantic segmentation performance is further improved.
For RGB-D semantic segmentation, at present, many methods mainly generate features with more representation capability by fusing RGB image features and depth image features to improve the performance of RGB-D image semantic segmentation, and generally adopt a coding and decoding structure, wherein the structure can be divided into early stage fusion, middle stage fusion and later stage fusion according to the fusion stage. Most of the fusion modules adopted by the methods directly fuse the depth map features and the color map features, and the depth information is not fully utilized, so that the complementary fusion of the color map features and the depth map features is realized. Meanwhile, because the imaging of depth cameras such as Realsense is influenced by factors such as illumination, a sliding surface and hardware interference, the problems of fuzzy boundary, large cavity area and the like of a depth image are caused, the method for directly fusing the characteristics of the two modes cannot eliminate noise existing in depth information, interference characteristics are introduced into a network model, the segmentation precision is finally reduced, and the robustness is poor.
Disclosure of Invention
The embodiment of the invention mainly aims to provide an attention-guided multi-modal feature fusion image semantic segmentation method and device, which can at least solve the problems of poor robustness, low segmentation precision and the like of an RGB-D image semantic segmentation model provided in the related technology.
In order to achieve the above object, a first aspect of embodiments of the present invention provides an attention-guided multi-modal feature fusion image semantic segmentation method, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network, and a multi-layer feature fusion decoding network, and the method includes:
respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;
connecting the color image features and the depth image features along a channel dimension through the multi-mode feature alignment network, performing convolution operation to obtain mixed features, aligning the color image features and the depth image features on the channel dimension and the space dimension based on the mixed features, and obtaining the aligned color image features and the aligned depth image features;
respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image characteristic and a second weight matrix of the importance degree of each position point of the aligned depth image characteristic through the cross-modal characteristic fusion network, and then fusing the first weight matrix with the aligned color image characteristic and fusing the second weight matrix with the aligned depth image characteristic and then performing superposition processing to obtain a fusion characteristic;
and performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph.
In order to achieve the above object, a second aspect of the embodiments of the present invention provides an attention-guided multi-modal feature fusion image semantic segmentation apparatus, applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network, and a multi-layered feature fusion decoding network, the apparatus including:
the extraction module is used for respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;
the alignment module is used for performing convolution operation to obtain mixed features after the color image features and the depth image features are connected along channel dimensions through the multi-mode feature alignment network, and aligning the color image features and the depth image features on the channel dimensions and the space dimensions based on the mixed features to obtain the aligned color image features and the aligned depth image features;
the fusion module is used for respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image feature and a second weight matrix of the importance degree of each position point of the aligned depth image feature through the cross-modal feature fusion network, then fusing the first weight matrix and the aligned color image feature, and fusing the second weight matrix and the aligned depth image feature and then performing superposition processing to obtain a fusion feature;
and the decoding module is used for performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network and outputting a semantic segmentation graph.
According to the image semantic segmentation method and device for attention-guided multi-modal feature fusion provided by the embodiment of the invention, the extracted color image features and depth image features are mixed; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating noise existing in a depth map, and aligning the two parts of features in a self-adaptive manner; in order to further complementarily fuse the two part features, the complementary fusion of the multi-modal features is realized by acquiring the importance degrees of the corresponding positions of the two parts of features and learning the complementary relation between the color image and the depth image in a self-adaptive manner; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, more detail information is added, more information is focused on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.
Other features and corresponding effects of the present invention are set forth in the following portions of the specification, and it should be understood that at least some of the effects are apparent from the description of the present invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating a basic flow of a semantic segmentation method for an image according to a first embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a multi-modal feature alignment network based on attention guidance according to a first embodiment of the present invention;
fig. 3 is a schematic structural diagram of a cross-modal feature fusion network according to a first embodiment of the present invention;
fig. 4 is a schematic structural diagram of a multi-layer feature fusion decoding network according to a first embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating program modules of an image semantic segmentation apparatus according to a second embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to a third embodiment of the invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment:
in order to solve the technical problems of poor robustness, low segmentation accuracy and the like of an RGB-D image semantic segmentation model provided in the related art, the present embodiment provides an image semantic segmentation method, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and as shown in fig. 1, is a basic flow diagram of the image semantic segmentation method provided in the present embodiment, and the image semantic segmentation method provided in the present embodiment includes the following steps:
Specifically, for example, a Kinect camera is configured with a color camera for capturing color images and an infrared camera for capturing depth images simultaneously, and the depth images can provide more geometric information and spatial information.
And 102, connecting the color image features and the depth image features along the channel dimension through a multi-mode feature alignment network, performing convolution operation to obtain mixed features, aligning the color image features and the depth image features on the channel dimension and the space dimension based on the mixed features, and obtaining the aligned color image features and the aligned depth image features.
Specifically, noise information inevitably exists in the depth map in the acquisition process, noise existing in the depth features cannot be considered in the conventional fusion method, and the multi-mode feature alignment network provided by the embodiment can effectively eliminate the noise features and realize alignment of the two features.
As shown in fig. 2, which is a schematic structural diagram of a multi-modal feature alignment network based on attention-directed provided in this embodiment, first, features obtained from a feature extraction network are connected together along a channel dimension, a convolution manner is adopted to reduce the number of channels connecting the features, and two parts of features are adaptively fused, and a specific implementation algorithm may be expressed as:
Frgbd=Ffc(X||D)
wherein X ∈ RC×H×W,D∈RC×H×WRespectively representing the color image features and the depth image features extracted from each layer of the feature extraction network, | | | represents that the two features are connected together along the channel dimension to obtain R2C×H×WCharacteristic of size, FfcRepresenting convolution operations, reducing the number of channels (R) for concatenated features2C×H×W→RC×H×W),Frgbd∈RC×H×WIndicating a mixing characteristic.
Secondly, performing global average pooling on the mixed features in the horizontal direction and the vertical direction respectively to obtain a feature vector in the horizontal direction and a feature vector in the vertical direction respectively; connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and coded intermediate feature map; and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.
Specifically, in this embodiment, global average pooling is performed on the mixed features in the horizontal and vertical directions, the mixed features are converted into feature vectors in the two directions, compression and excitation are performed on the channel dimension, feature alignment on the channel dimension is realized, and spatial position information is retained in the feature alignment process. The specific implementation algorithm of this process can be expressed as the following formula:
f=δ(Fh||Fw)
wherein the feature F after fusionrgbd∈RC×H×WPerforming global average pooling in horizontal and vertical directions to obtain a feature Fh(h)∈RC×H×1,Fw(w)∈RC×1×W, | | denotes the joining together of the two features along the channel dimension, δ denotes the nonlinear activation function that compresses,the characteristic is an intermediate characteristic diagram obtained by compression-encoding spatial information in the horizontal direction and the vertical direction, and r is a coefficient for controlling a reduction ratio.
In an optional embodiment of this embodiment, further, a preset alignment repair algorithm is used to perform feature alignment on the compressed and encoded intermediate feature map, so as to obtain an aligned color map feature and an aligned depth map feature. That is, multiplication is carried out on the original features one by one to obtain a feature representation with less noise, and finally, the feature response of the output is improved at the corresponding position in a weighting mode.
The alignment repair algorithm is represented as:
wherein X ∈ RC×H×W、D∈RC×H×WRespectively representing the color image characteristics and the depth image characteristics obtained by the characteristic extraction network, representing sigmoid activation function by sigma,andrepresenting the division of the compressed encoded intermediate feature map f into two independent feature maps, X, along the channel dimensionrep∈RC×H×W、Drep∈RC×H×wRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics,representing a convolution operation, passingThe corresponding convolution operation will fhReduction to the feature of channel number CAdaptively acquiring a weight matrix of the color image characteristics along the horizontal direction; throughThe corresponding convolution operation will fwReduction to the feature of channel number CAdaptively acquiring a weight matrix of color image features in the vertical direction, byCorresponding convolution operation, will fhReduction to the feature of channel number CAdaptively acquiring a weight matrix of the depth map features along the horizontal direction; throughCorresponding convolution operation, will fwReduction to the feature of channel number CAnd adaptively acquiring a weight matrix of the depth map features along the vertical direction.
Thus, the network can take advantage of the most useful visual appearance and geometry and thus effectively suppress noise features in the depth stream.
103, respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image features and a second weight matrix of the importance degree of each position point of the aligned depth image features through a cross-modal feature fusion network, then fusing the first weight matrix and the aligned color image features, and fusing the second weight matrix and the aligned depth image features and then performing superposition processing to obtain fusion features.
As shown in fig. 3, which is a schematic structural diagram of a cross-modal feature fusion network provided in this embodiment, aligned depth features and color features are compressed into one channel, and convolution is used to learn difference distributions of different position points of the color features and the depth features. In order to further obtain color features and depth features with complementarity, the feature map is divided into two feature maps along the channel direction, the importance degrees of each position point of the two feature maps are obtained by utilizing the complementation of the softmax function, and the feature points are respectively fused with the corresponding aligned features and then superposed, so that the color features and the depth map features can adaptively generate high-quality feature maps.
Specifically, in this embodiment, a preset difference distribution learning algorithm may be adopted through a cross-modal feature fusion network to respectively obtain a first weight matrix of importance degrees of each position point of aligned color image features and a second weight matrix of importance degrees of each position point of aligned depth image features;
the difference distribution learning algorithm is expressed as:
wherein, Xrep、DrepRespectively representing the aligned color map features and the aligned depth map features,representing a convolution operation, passingCorresponding convolution operation, input characteristic X with channel number CrepCompressed to a channel (R)C×H×W→R1×H×W) Through which is passedCorresponding convolution operation to input characteristic D with channel number CrepCompressed to a channel (R)C×H×W→R1×H×W) And | l represents connection along the channel dimension, delta represents a softmax activation function, the relative importance degree of each position point of the depth map feature and the color map feature can be obtained in a complementary way through the function, the complementary fusion of the two features is realized, and GX′∈R1×H×W、GD′∈R1×H×WRespectively representing a first weight matrix and a second weight matrix.
Further, the fusion feature is obtained based on a preset fusion algorithm, and the fusion algorithm can be expressed as:
Xfusion=GX′×Xrep+GD′×Drep
wherein, Xrep∈RC×H×W、Drep∈RC×H×WRepresenting features after alignment, Xfusion∈RC×H×WRepresenting the fused output features.
And step 104, performing convolution operation and upsampling processing on the fusion features layer by layer through a multilayer feature fusion decoding network, and outputting a semantic segmentation graph.
In particular, in the encoder-decoder architecture, the loss of detail information may result due to multiple downsampling at the encoder stage. Assuming that the decoding stage directly utilizes the fusion features extracted by the backbone network at the final stage to perform upsampling, the finally obtained segmentation result is inaccurate in the boundary region, and even results in wrong segmentation in the boundary region. Based on this, as shown in fig. 4, which is a schematic structural diagram of a multi-layer feature fusion decoding network provided in this embodiment, a preset decoding algorithm is adopted by the multi-layer feature fusion decoding network, and the fusion features are subjected to convolution operation and upsampling processing layer by layer, so as to output a semantic segmentation map, where the decoding algorithm is expressed as:
Gfusion=F3,2,1+FNBt1D(δ(Fc(G)))
wherein, F3,,Representing the fusion characteristics obtained in the encoding phase, FNBt1DThree layers NBt1D are shown, each NBt1D module comprises four convolution layers, is a deconvolution, and has smaller parameter and less operation quantity compared with the common convolution, wherein delta represents Relu activation function, FcRepresenting a 3 x 3 convolutional layer, G representing the input characteristics of each stage of the decoding network, GfusionAnd the semantic segmentation graph which represents the output of each stage is restored to the size of the original image through upsampling.
Therefore, the embodiment adds the fusion features of the encoding stage in the decoding stage, and utilizes the NBt1D module to add the fusion features into the decoder layer by layer, and the NBt1D module has the characteristic of light weight, so that a high-resolution accurate semantic segmentation map can be constructed without introducing large parameters.
According to the image semantic segmentation method for attention-guided multi-modal feature fusion provided by the embodiment of the invention, firstly, features are extracted from an RGB (red, green and blue) image and a depth image by using a feature extraction network, and the feature refining function is controlled by using the attention. In addition, in order to better fuse two-part features, the invention utilizes the complementarity of the two parts to realize cross-complementary aggregation of features in a space dimension, and finally realizes cross-modal feature fusion. In the encoding stage, due to the fact that multiple times of downsampling may cause loss of detail information, the upsampling in the decoding stage can introduce fusion features with stronger representation capability obtained in the encoding stage. In addition, experiments are carried out on the RGB-D semantic segmentation data sets Nyuv2 and SUNRGBD, and the results show that the method is better than most of the existing methods in the RGB-D semantic segmentation task, and has the advantages of good segmentation effect and small parameter quantity.
Second embodiment:
in order to solve the technical problems of poor robustness, low segmentation accuracy and the like of an RGB-D image semantic segmentation model provided in the related art, this embodiment shows an image semantic segmentation apparatus, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and specifically refer to fig. 5, the image semantic segmentation apparatus of this embodiment includes:
an extraction module 501, configured to perform feature extraction processing on the color image and the corresponding depth image through a feature extraction network, respectively, to obtain a color image feature and a depth image feature;
an alignment module 502, configured to connect the color image features and the depth image features along the channel dimension through a multi-modal feature alignment network, perform convolution operation to obtain mixed features, and align the color image features and the depth image features in the channel dimension and the spatial dimension based on the mixed features to obtain aligned color image features and aligned depth image features;
the fusion module 503 is configured to obtain a first weight matrix of importance degrees of each position point of the aligned color image feature and a second weight matrix of importance degrees of each position point of the aligned depth image feature through a cross-modal feature fusion network, and then fuse the first weight matrix with the aligned color image feature and fuse the second weight matrix with the aligned depth image feature and then perform superposition processing to obtain a fusion feature;
and the decoding module 504 is configured to perform convolution operation and upsampling processing on the fusion features layer by layer through a multi-layer feature fusion decoding network, and output a semantic segmentation map.
In an optional implementation manner of this embodiment, the alignment module is specifically configured to: performing global average pooling on the mixed features in the horizontal direction and the vertical direction respectively to obtain a feature vector in the horizontal direction and a feature vector in the vertical direction respectively; connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and coded intermediate feature map; and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.
Further, in an optional implementation manner of this embodiment, the alignment module is specifically configured to: performing feature alignment on the compressed and coded intermediate feature map by adopting a preset alignment repair algorithm to obtain aligned color map features and aligned depth map features;
the alignment repair algorithm is represented as:
wherein X ∈ RC×H×W、D∈RC×H×WRespectively representing feature extraction network to obtain color image features and depthGraph characteristics, σ denotes a sigmoid activation function,andrepresenting two independent feature maps into which the compressed and encoded intermediate feature map is divided along the channel dimension,representing a convolution operation, Xrep∈RC×H×W、Drep∈RC×H×WRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics.
In an optional implementation manner of this embodiment, the fusion module is specifically configured to: respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image characteristic and a second weight matrix of the importance degree of each position point of the aligned depth image characteristic by adopting a preset difference distribution learning algorithm through a cross-modal characteristic fusion network;
the difference distribution learning algorithm is expressed as:
wherein, Xrep、DrepRespectively representing the aligned color map features and the aligned depth map features,representing convolution operations, | | representing connections along the channel dimension, | representing the softmax activation function, GX′∈R1×H×W、GD′∈R1×H×WRespectively representing a first weight matrix and a second weight matrix.
In an optional implementation manner of this embodiment, the decoding module is specifically configured to: performing convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through a multilayer feature fusion decoding network, and outputting a semantic segmentation graph;
the decoding algorithm is represented as:
Gfusion=F3,2,1+FNBt1D(δ(Fc(G)))
wherein, F3,2,1Representing the fusion characteristics of the encoding phase, FNBt1DThree layers NBt1D are shown, each NBt1D module containing four convolutional layers, δ representing the Relu activation function, FcRepresenting a 3 × 3 convolutional layer, G representing the input characteristics of each stage, GfusionRepresenting a semantic segmentation graph.
It should be noted that, the image semantic segmentation method in the foregoing embodiment can be implemented based on the image semantic segmentation device provided in this embodiment, and it can be clearly understood by those skilled in the art that, for convenience and simplicity of description, a specific working process of the image semantic segmentation device described in this embodiment may refer to a corresponding process in the foregoing method embodiment, and details are not described here.
The image semantic segmentation device for guiding multi-modal feature fusion by adopting the attention provided by the embodiment is used for mixing the extracted color image features and the extracted depth image features; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating noise existing in a depth map, and aligning the two parts of features in a self-adaptive manner; in order to further complementarily fuse the two part features, the complementary fusion of the multi-modal features is realized by acquiring the importance degrees of the corresponding positions of the two parts of features and learning the complementary relation between the color image and the depth image in a self-adaptive manner; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, more detail information is added, more information is focused on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.
The third embodiment:
the present embodiment provides an electronic device, as shown in fig. 6, which includes a processor 601, a memory 602, and a communication bus 603, wherein: the communication bus 603 is used for realizing connection communication between the processor 601 and the memory 602; the processor 601 is configured to execute one or more computer programs stored in the memory 602 to implement at least one step of the image semantic segmentation method in the first embodiment.
The present embodiments also provide a computer-readable storage medium including volatile or non-volatile, removable or non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media include, but are not limited to, RAM (Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact disk Read-Only Memory), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
The computer-readable storage medium in this embodiment may be used for storing one or more computer programs, and the stored one or more computer programs may be executed by a processor to implement at least one step of the method in the first embodiment.
The present embodiment also provides a computer program, which can be distributed on a computer readable medium and executed by a computing device to implement at least one step of the method in the first embodiment; and in some cases at least one of the steps shown or described may be performed in an order different than that described in the embodiments above.
The present embodiments also provide a computer program product comprising a computer readable means on which a computer program as shown above is stored. The computer readable means in this embodiment may include a computer readable storage medium as shown above.
It will be apparent to those skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software (which may be implemented in computer program code executable by a computing device), firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit.
In addition, communication media typically embodies computer readable instructions, data structures, computer program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to one of ordinary skill in the art. Thus, the present invention is not limited to any specific combination of hardware and software.
The foregoing is a more detailed description of embodiments of the present invention, and the present invention is not to be considered limited to such descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (10)
1. An attention-guided multi-modal feature fusion image semantic segmentation method is applied to an overall neural network comprising a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and is characterized by comprising the following steps of:
respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;
connecting the color image features and the depth image features along a channel dimension through the multi-mode feature alignment network, performing convolution operation to obtain mixed features, aligning the color image features and the depth image features on the channel dimension and the space dimension based on the mixed features, and obtaining the aligned color image features and the aligned depth image features;
respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image characteristic and a second weight matrix of the importance degree of each position point of the aligned depth image characteristic through the cross-modal characteristic fusion network, and then fusing the first weight matrix with the aligned color image characteristic and fusing the second weight matrix with the aligned depth image characteristic and then performing superposition processing to obtain a fusion characteristic;
and performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph.
2. The image semantic segmentation method according to claim 1, wherein the step of aligning the color map features and the depth map features in a channel dimension and a space dimension based on the mixture features to obtain aligned color map features and aligned depth map features comprises:
respectively carrying out global average pooling on the mixed features in the horizontal direction and the vertical direction to respectively obtain a feature vector in the horizontal direction and a feature vector in the vertical direction;
connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along a channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and encoded intermediate feature map;
and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.
3. The image semantic segmentation method according to claim 2, wherein the step of performing feature alignment based on the compressed and encoded intermediate feature map to obtain an aligned color map feature and an aligned depth map feature comprises:
performing feature alignment on the compressed and coded intermediate feature map by adopting a preset alignment repair algorithm to obtain aligned color map features and aligned depth map features;
the alignment repair algorithm is represented as:
wherein X ∈ RC×H×W、D∈RC×H×WRespectively representing the color image characteristics and the depth image characteristics obtained by the characteristic extraction network, representing sigmoid activation function by sigma,andrepresenting two independent feature maps into which the compressed and encoded intermediate feature map is divided along the channel dimension,representing a convolution operation, Xrep∈RC×H×W、Drep∈RC×H×WRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics.
4. The image semantic segmentation method according to claim 1, wherein the step of obtaining a first weight matrix of the importance degree of each position point of the aligned color image feature and a second weight matrix of the importance degree of each position point of the aligned depth image feature through the cross-modal feature fusion network respectively comprises:
respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image features and a second weight matrix of the importance degree of each position point of the aligned depth image features by adopting a preset difference distribution learning algorithm through the cross-modal feature fusion network;
the difference distribution learning algorithm is expressed as:
wherein, Xrep、DrepRespectively representing the aligned color map features and the aligned depth map features,representing convolution operations, | | representing connections along the channel dimension, | representing the softmax activation function, GX′∈R1×H×W、GD′∈R1×H×WRespectively representing the first weight matrix and the second weight matrix.
5. The image semantic segmentation method according to any one of claims 1 to 4, wherein the step of performing convolution operation and upsampling processing on the fused features layer by layer through the multi-layer feature fusion decoding network to output a semantic segmentation map comprises:
performing convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph;
the decoding algorithm is represented as:
Gfusion=F3,2,1+FNBt1D(δ(Fc(G)))
wherein, F3,2,1Said fusion characteristic representing the encoding phase, FNBt1DThree layers NBt1D are shown, each of the NBt1D modules containing four convolutional layers, δ representing the Relu activation function, FcRepresenting a 3 × 3 convolutional layer, G representing the input characteristics of each stage, GfusionRepresenting the semantic segmentation graph.
6. An attention-guided multi-modal feature fusion image semantic segmentation device is applied to an overall neural network comprising a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and is characterized by comprising:
the extraction module is used for respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;
the alignment module is used for performing convolution operation to obtain mixed features after the color image features and the depth image features are connected along channel dimensions through the multi-mode feature alignment network, and aligning the color image features and the depth image features on the channel dimensions and the space dimensions based on the mixed features to obtain the aligned color image features and the aligned depth image features;
the fusion module is used for respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image feature and a second weight matrix of the importance degree of each position point of the aligned depth image feature through the cross-modal feature fusion network, then fusing the first weight matrix and the aligned color image feature, and fusing the second weight matrix and the aligned depth image feature and then performing superposition processing to obtain a fusion feature;
and the decoding module is used for performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network and outputting a semantic segmentation graph.
7. The image semantic segmentation apparatus according to claim 6, wherein the alignment module is specifically configured to:
respectively carrying out global average pooling on the mixed features in the horizontal direction and the vertical direction to respectively obtain a feature vector in the horizontal direction and a feature vector in the vertical direction;
connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along a channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and encoded intermediate feature map;
and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.
8. The image semantic segmentation apparatus of claim 7, wherein the alignment module is further configured to:
performing feature alignment on the compressed and coded intermediate feature map by adopting a preset alignment repair algorithm to obtain aligned color map features and aligned depth map features;
the alignment repair algorithm is represented as:
wherein X ∈ RC×H×W、D∈RC×H×WRespectively representing the color image characteristics and the depth image characteristics obtained by the characteristic extraction network, representing sigmoid activation function by sigma,andrepresenting two independent feature maps into which the compressed and encoded intermediate feature map is divided along the channel dimension,representing a convolution operation, Xrep∈RC×H×W、Drep∈RC×H×WRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics.
9. The image semantic segmentation apparatus according to claim 6, wherein the fusion module is specifically configured to:
respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image features and a second weight matrix of the importance degree of each position point of the aligned depth image features by adopting a preset difference distribution learning algorithm through the cross-modal feature fusion network;
the difference distribution learning algorithm is expressed as:
wherein, Xrep、DrepRespectively representing the aligned color map features and the aligned depth map features,representing convolution operations, | | representing connections along the channel dimension, | representing the softmax activation function, GX′∈R1×H×W、GD′∈R1×H×WRespectively representing the first weight matrix and the second weight matrix.
10. The image semantic segmentation apparatus according to any one of claims 6 to 9, wherein the decoding module is specifically configured to:
performing convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph;
the decoding algorithm is represented as:
Gfusion=F3,2,1+FNBt1D(δ(Fc(G)))
wherein, F3,2,1Said fusion characteristic representing the encoding phase, FNBt1DThree layers NBt1D are shown, each of the NBt1D modules containing four convolutional layers, δ representing the Relu activation function, FcRepresenting a 3 × 3 convolutional layer, G representing the input characteristics of each stage, GfusionRepresenting the semantic segmentation graph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111658857.9A CN114372986B (en) | 2021-12-30 | 2021-12-30 | Image semantic segmentation method and device for attention-guided multi-modal feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111658857.9A CN114372986B (en) | 2021-12-30 | 2021-12-30 | Image semantic segmentation method and device for attention-guided multi-modal feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114372986A true CN114372986A (en) | 2022-04-19 |
CN114372986B CN114372986B (en) | 2024-05-24 |
Family
ID=81141205
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111658857.9A Active CN114372986B (en) | 2021-12-30 | 2021-12-30 | Image semantic segmentation method and device for attention-guided multi-modal feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114372986B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114842312A (en) * | 2022-05-09 | 2022-08-02 | 深圳市大数据研究院 | Generation and segmentation method and device for unpaired cross-modal image segmentation model |
CN115496975A (en) * | 2022-08-29 | 2022-12-20 | 锋睿领创(珠海)科技有限公司 | Auxiliary weighted data fusion method, device, equipment and storage medium |
CN116109645A (en) * | 2023-04-14 | 2023-05-12 | 锋睿领创(珠海)科技有限公司 | Intelligent processing method, device, equipment and medium based on priori knowledge |
CN116935052A (en) * | 2023-07-24 | 2023-10-24 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
CN116978011A (en) * | 2023-08-23 | 2023-10-31 | 广州新华学院 | Image semantic communication method and system for intelligent target recognition |
CN117014633A (en) * | 2023-10-07 | 2023-11-07 | 深圳大学 | Cross-modal data compression method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298361A (en) * | 2019-05-22 | 2019-10-01 | 浙江省北大信息技术高等研究院 | A kind of semantic segmentation method and system of RGB-D image |
CN110929696A (en) * | 2019-12-16 | 2020-03-27 | 中国矿业大学 | Remote sensing image semantic segmentation method based on multi-mode attention and self-adaptive fusion |
CN112634296A (en) * | 2020-10-12 | 2021-04-09 | 深圳大学 | RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism |
WO2021088300A1 (en) * | 2019-11-09 | 2021-05-14 | 北京工业大学 | Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network |
CN113205520A (en) * | 2021-04-22 | 2021-08-03 | 华中科技大学 | Method and system for semantic segmentation of image |
-
2021
- 2021-12-30 CN CN202111658857.9A patent/CN114372986B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298361A (en) * | 2019-05-22 | 2019-10-01 | 浙江省北大信息技术高等研究院 | A kind of semantic segmentation method and system of RGB-D image |
WO2021088300A1 (en) * | 2019-11-09 | 2021-05-14 | 北京工业大学 | Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network |
CN110929696A (en) * | 2019-12-16 | 2020-03-27 | 中国矿业大学 | Remote sensing image semantic segmentation method based on multi-mode attention and self-adaptive fusion |
CN112634296A (en) * | 2020-10-12 | 2021-04-09 | 深圳大学 | RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism |
CN113205520A (en) * | 2021-04-22 | 2021-08-03 | 华中科技大学 | Method and system for semantic segmentation of image |
Non-Patent Citations (1)
Title |
---|
张娣;陆建峰;: "基于双目图像与跨级特征引导的语义分割模型", 计算机工程, no. 10, 15 October 2020 (2020-10-15) * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114842312A (en) * | 2022-05-09 | 2022-08-02 | 深圳市大数据研究院 | Generation and segmentation method and device for unpaired cross-modal image segmentation model |
CN115496975A (en) * | 2022-08-29 | 2022-12-20 | 锋睿领创(珠海)科技有限公司 | Auxiliary weighted data fusion method, device, equipment and storage medium |
CN115496975B (en) * | 2022-08-29 | 2023-08-18 | 锋睿领创(珠海)科技有限公司 | Auxiliary weighted data fusion method, device, equipment and storage medium |
CN116109645A (en) * | 2023-04-14 | 2023-05-12 | 锋睿领创(珠海)科技有限公司 | Intelligent processing method, device, equipment and medium based on priori knowledge |
CN116935052A (en) * | 2023-07-24 | 2023-10-24 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
CN116935052B (en) * | 2023-07-24 | 2024-03-01 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
CN116978011A (en) * | 2023-08-23 | 2023-10-31 | 广州新华学院 | Image semantic communication method and system for intelligent target recognition |
CN116978011B (en) * | 2023-08-23 | 2024-03-15 | 广州新华学院 | Image semantic communication method and system for intelligent target recognition |
CN117014633A (en) * | 2023-10-07 | 2023-11-07 | 深圳大学 | Cross-modal data compression method, device, equipment and medium |
CN117014633B (en) * | 2023-10-07 | 2024-04-05 | 深圳大学 | Cross-modal data compression method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN114372986B (en) | 2024-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114372986A (en) | Attention-guided multi-modal feature fusion image semantic segmentation method and device | |
Guo et al. | Dense scene information estimation network for dehazing | |
CN111402146B (en) | Image processing method and image processing apparatus | |
WO2021164731A1 (en) | Image enhancement method and image enhancement apparatus | |
KR20210031427A (en) | Methods, devices, computer devices and media for recognizing traffic images | |
CN111353948B (en) | Image noise reduction method, device and equipment | |
CN113870335A (en) | Monocular depth estimation method based on multi-scale feature fusion | |
WO2018168539A1 (en) | Learning method and program | |
US20220270215A1 (en) | Method for applying bokeh effect to video image and recording medium | |
CN111145290A (en) | Image colorization method, system and computer readable storage medium | |
CN114038006A (en) | Matting network training method and matting method | |
US20240112404A1 (en) | Image modification techniques | |
Yuan et al. | Multiview scene image inpainting based on conditional generative adversarial networks | |
CN116563459A (en) | Text-driven immersive open scene neural rendering and mixing enhancement method | |
Liang et al. | Learning to remove sandstorm for image enhancement | |
CN113487530A (en) | Infrared and visible light fusion imaging method based on deep learning | |
CN113436107A (en) | Image enhancement method, intelligent device and computer storage medium | |
CN116993987A (en) | Image semantic segmentation method and system based on lightweight neural network model | |
CN117456330A (en) | MSFAF-Net-based low-illumination target detection method | |
CN117078574A (en) | Image rain removing method and device | |
CN115965531A (en) | Model training method, image generation method, device, equipment and storage medium | |
CN116309215A (en) | Image fusion method based on double decoders | |
CN116342877A (en) | Semantic segmentation method based on improved ASPP and fusion module in complex scene | |
CN115423697A (en) | Image restoration method, terminal and computer storage medium | |
CN115205487A (en) | Monocular camera face reconstruction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |