CN114372986A - Attention-guided multi-modal feature fusion image semantic segmentation method and device - Google Patents

Attention-guided multi-modal feature fusion image semantic segmentation method and device Download PDF

Info

Publication number
CN114372986A
CN114372986A CN202111658857.9A CN202111658857A CN114372986A CN 114372986 A CN114372986 A CN 114372986A CN 202111658857 A CN202111658857 A CN 202111658857A CN 114372986 A CN114372986 A CN 114372986A
Authority
CN
China
Prior art keywords
features
feature
fusion
aligned
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111658857.9A
Other languages
Chinese (zh)
Other versions
CN114372986B (en
Inventor
钦闯
邹文斌
田时舜
李霞
邹辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huishi Innovation Shenzhen Co ltd
Shenzhen University
Original Assignee
Huishi Innovation Shenzhen Co ltd
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huishi Innovation Shenzhen Co ltd, Shenzhen University filed Critical Huishi Innovation Shenzhen Co ltd
Priority to CN202111658857.9A priority Critical patent/CN114372986B/en
Publication of CN114372986A publication Critical patent/CN114372986A/en
Application granted granted Critical
Publication of CN114372986B publication Critical patent/CN114372986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

According to the image semantic segmentation method and device for attention-directed multi-modal feature fusion disclosed by the embodiment of the invention, the extracted color image features and depth image features are mixed; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating the noise of a depth map, and adaptively aligning the two features; in order to further complementarily fuse the two part features, the complementary relation between the color image and the depth image is self-adaptively learned by acquiring the importance degrees of the corresponding positions of the two part features, so that the complementary fusion of the multi-mode features is realized; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, and more detail information is added, so that more information focuses on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.

Description

Attention-guided multi-modal feature fusion image semantic segmentation method and device
Technical Field
The invention relates to the technical field of image processing, in particular to an image semantic segmentation method and device for attention-guided multi-modal feature fusion.
Background
The semantic segmentation aims at accurately classifying each pixel point in an image, is a pixel-level classification method, and is widely applied to a plurality of fields of vision-based automatic driving, man-machine interaction, medical image segmentation, three-dimensional map reconstruction and the like. The scene information in the image is effectively acquired through accurate pixel classification, the specific position of each target in the image can be acquired through the segmentation result, the category and the state of each target can be further acquired, and the automatic understanding of the scene by a computer through the acquired image information is one of the most challenging tasks in computer vision. In recent years, depth cameras, such as Intel's Realsense camera, microsoft's Kinect camera, and depth camera, have been widely used to improve semantic segmentation performance. Compared with the color image, the depth information provides semantic information and also provides the size and geometric information of an object in an actual scene, and the semantic segmentation performance is further improved.
For RGB-D semantic segmentation, at present, many methods mainly generate features with more representation capability by fusing RGB image features and depth image features to improve the performance of RGB-D image semantic segmentation, and generally adopt a coding and decoding structure, wherein the structure can be divided into early stage fusion, middle stage fusion and later stage fusion according to the fusion stage. Most of the fusion modules adopted by the methods directly fuse the depth map features and the color map features, and the depth information is not fully utilized, so that the complementary fusion of the color map features and the depth map features is realized. Meanwhile, because the imaging of depth cameras such as Realsense is influenced by factors such as illumination, a sliding surface and hardware interference, the problems of fuzzy boundary, large cavity area and the like of a depth image are caused, the method for directly fusing the characteristics of the two modes cannot eliminate noise existing in depth information, interference characteristics are introduced into a network model, the segmentation precision is finally reduced, and the robustness is poor.
Disclosure of Invention
The embodiment of the invention mainly aims to provide an attention-guided multi-modal feature fusion image semantic segmentation method and device, which can at least solve the problems of poor robustness, low segmentation precision and the like of an RGB-D image semantic segmentation model provided in the related technology.
In order to achieve the above object, a first aspect of embodiments of the present invention provides an attention-guided multi-modal feature fusion image semantic segmentation method, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network, and a multi-layer feature fusion decoding network, and the method includes:
respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;
connecting the color image features and the depth image features along a channel dimension through the multi-mode feature alignment network, performing convolution operation to obtain mixed features, aligning the color image features and the depth image features on the channel dimension and the space dimension based on the mixed features, and obtaining the aligned color image features and the aligned depth image features;
respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image characteristic and a second weight matrix of the importance degree of each position point of the aligned depth image characteristic through the cross-modal characteristic fusion network, and then fusing the first weight matrix with the aligned color image characteristic and fusing the second weight matrix with the aligned depth image characteristic and then performing superposition processing to obtain a fusion characteristic;
and performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph.
In order to achieve the above object, a second aspect of the embodiments of the present invention provides an attention-guided multi-modal feature fusion image semantic segmentation apparatus, applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network, and a multi-layered feature fusion decoding network, the apparatus including:
the extraction module is used for respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;
the alignment module is used for performing convolution operation to obtain mixed features after the color image features and the depth image features are connected along channel dimensions through the multi-mode feature alignment network, and aligning the color image features and the depth image features on the channel dimensions and the space dimensions based on the mixed features to obtain the aligned color image features and the aligned depth image features;
the fusion module is used for respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image feature and a second weight matrix of the importance degree of each position point of the aligned depth image feature through the cross-modal feature fusion network, then fusing the first weight matrix and the aligned color image feature, and fusing the second weight matrix and the aligned depth image feature and then performing superposition processing to obtain a fusion feature;
and the decoding module is used for performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network and outputting a semantic segmentation graph.
According to the image semantic segmentation method and device for attention-guided multi-modal feature fusion provided by the embodiment of the invention, the extracted color image features and depth image features are mixed; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating noise existing in a depth map, and aligning the two parts of features in a self-adaptive manner; in order to further complementarily fuse the two part features, the complementary fusion of the multi-modal features is realized by acquiring the importance degrees of the corresponding positions of the two parts of features and learning the complementary relation between the color image and the depth image in a self-adaptive manner; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, more detail information is added, more information is focused on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.
Other features and corresponding effects of the present invention are set forth in the following portions of the specification, and it should be understood that at least some of the effects are apparent from the description of the present invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating a basic flow of a semantic segmentation method for an image according to a first embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a multi-modal feature alignment network based on attention guidance according to a first embodiment of the present invention;
fig. 3 is a schematic structural diagram of a cross-modal feature fusion network according to a first embodiment of the present invention;
fig. 4 is a schematic structural diagram of a multi-layer feature fusion decoding network according to a first embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating program modules of an image semantic segmentation apparatus according to a second embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to a third embodiment of the invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment:
in order to solve the technical problems of poor robustness, low segmentation accuracy and the like of an RGB-D image semantic segmentation model provided in the related art, the present embodiment provides an image semantic segmentation method, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and as shown in fig. 1, is a basic flow diagram of the image semantic segmentation method provided in the present embodiment, and the image semantic segmentation method provided in the present embodiment includes the following steps:
step 101, respectively performing feature extraction processing on the color image and the corresponding depth image through a feature extraction network to obtain color image features and depth image features.
Specifically, for example, a Kinect camera is configured with a color camera for capturing color images and an infrared camera for capturing depth images simultaneously, and the depth images can provide more geometric information and spatial information.
And 102, connecting the color image features and the depth image features along the channel dimension through a multi-mode feature alignment network, performing convolution operation to obtain mixed features, aligning the color image features and the depth image features on the channel dimension and the space dimension based on the mixed features, and obtaining the aligned color image features and the aligned depth image features.
Specifically, noise information inevitably exists in the depth map in the acquisition process, noise existing in the depth features cannot be considered in the conventional fusion method, and the multi-mode feature alignment network provided by the embodiment can effectively eliminate the noise features and realize alignment of the two features.
As shown in fig. 2, which is a schematic structural diagram of a multi-modal feature alignment network based on attention-directed provided in this embodiment, first, features obtained from a feature extraction network are connected together along a channel dimension, a convolution manner is adopted to reduce the number of channels connecting the features, and two parts of features are adaptively fused, and a specific implementation algorithm may be expressed as:
Frgbd=Ffc(X||D)
wherein X ∈ RC×H×W,D∈RC×H×WRespectively representing the color image features and the depth image features extracted from each layer of the feature extraction network, | | | represents that the two features are connected together along the channel dimension to obtain R2C×H×WCharacteristic of size, FfcRepresenting convolution operations, reducing the number of channels (R) for concatenated features2C×H×W→RC×H×W),Frgbd∈RC×H×WIndicating a mixing characteristic.
Secondly, performing global average pooling on the mixed features in the horizontal direction and the vertical direction respectively to obtain a feature vector in the horizontal direction and a feature vector in the vertical direction respectively; connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and coded intermediate feature map; and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.
Specifically, in this embodiment, global average pooling is performed on the mixed features in the horizontal and vertical directions, the mixed features are converted into feature vectors in the two directions, compression and excitation are performed on the channel dimension, feature alignment on the channel dimension is realized, and spatial position information is retained in the feature alignment process. The specific implementation algorithm of this process can be expressed as the following formula:
Figure BDA0003446793930000051
Figure BDA0003446793930000052
f=δ(Fh||Fw)
wherein the feature F after fusionrgbd∈RC×H×WPerforming global average pooling in horizontal and vertical directions to obtain a feature Fh(h)∈RC×H×1,Fw(w)∈RC×1×W, | | denotes the joining together of the two features along the channel dimension, δ denotes the nonlinear activation function that compresses,
Figure BDA0003446793930000053
the characteristic is an intermediate characteristic diagram obtained by compression-encoding spatial information in the horizontal direction and the vertical direction, and r is a coefficient for controlling a reduction ratio.
In an optional embodiment of this embodiment, further, a preset alignment repair algorithm is used to perform feature alignment on the compressed and encoded intermediate feature map, so as to obtain an aligned color map feature and an aligned depth map feature. That is, multiplication is carried out on the original features one by one to obtain a feature representation with less noise, and finally, the feature response of the output is improved at the corresponding position in a weighting mode.
The alignment repair algorithm is represented as:
Figure BDA0003446793930000054
Figure BDA0003446793930000055
wherein X ∈ RC×H×W、D∈RC×H×WRespectively representing the color image characteristics and the depth image characteristics obtained by the characteristic extraction network, representing sigmoid activation function by sigma,
Figure BDA0003446793930000056
and
Figure BDA0003446793930000057
representing the division of the compressed encoded intermediate feature map f into two independent feature maps, X, along the channel dimensionrep∈RC×H×W、Drep∈RC×H×wRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics,
Figure BDA0003446793930000058
representing a convolution operation, passing
Figure BDA0003446793930000059
The corresponding convolution operation will fhReduction to the feature of channel number C
Figure BDA0003446793930000061
Adaptively acquiring a weight matrix of the color image characteristics along the horizontal direction; through
Figure BDA0003446793930000062
The corresponding convolution operation will fwReduction to the feature of channel number C
Figure BDA0003446793930000063
Adaptively acquiring a weight matrix of color image features in the vertical direction, by
Figure BDA0003446793930000064
Corresponding convolution operation, will fhReduction to the feature of channel number C
Figure BDA0003446793930000065
Adaptively acquiring a weight matrix of the depth map features along the horizontal direction; through
Figure BDA0003446793930000066
Corresponding convolution operation, will fwReduction to the feature of channel number C
Figure BDA0003446793930000067
And adaptively acquiring a weight matrix of the depth map features along the vertical direction.
Thus, the network can take advantage of the most useful visual appearance and geometry and thus effectively suppress noise features in the depth stream.
103, respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image features and a second weight matrix of the importance degree of each position point of the aligned depth image features through a cross-modal feature fusion network, then fusing the first weight matrix and the aligned color image features, and fusing the second weight matrix and the aligned depth image features and then performing superposition processing to obtain fusion features.
As shown in fig. 3, which is a schematic structural diagram of a cross-modal feature fusion network provided in this embodiment, aligned depth features and color features are compressed into one channel, and convolution is used to learn difference distributions of different position points of the color features and the depth features. In order to further obtain color features and depth features with complementarity, the feature map is divided into two feature maps along the channel direction, the importance degrees of each position point of the two feature maps are obtained by utilizing the complementation of the softmax function, and the feature points are respectively fused with the corresponding aligned features and then superposed, so that the color features and the depth map features can adaptively generate high-quality feature maps.
Specifically, in this embodiment, a preset difference distribution learning algorithm may be adopted through a cross-modal feature fusion network to respectively obtain a first weight matrix of importance degrees of each position point of aligned color image features and a second weight matrix of importance degrees of each position point of aligned depth image features;
the difference distribution learning algorithm is expressed as:
Figure BDA0003446793930000068
wherein, Xrep、DrepRespectively representing the aligned color map features and the aligned depth map features,
Figure BDA0003446793930000069
representing a convolution operation, passing
Figure BDA00034467939300000610
Corresponding convolution operation, input characteristic X with channel number CrepCompressed to a channel (R)C×H×W→R1×H×W) Through which is passed
Figure BDA00034467939300000611
Corresponding convolution operation to input characteristic D with channel number CrepCompressed to a channel (R)C×H×W→R1×H×W) And | l represents connection along the channel dimension, delta represents a softmax activation function, the relative importance degree of each position point of the depth map feature and the color map feature can be obtained in a complementary way through the function, the complementary fusion of the two features is realized, and GX′∈R1×H×W、GD′∈R1×H×WRespectively representing a first weight matrix and a second weight matrix.
Further, the fusion feature is obtained based on a preset fusion algorithm, and the fusion algorithm can be expressed as:
Xfusion=GX′×Xrep+GD′×Drep
wherein, Xrep∈RC×H×W、Drep∈RC×H×WRepresenting features after alignment, Xfusion∈RC×H×WRepresenting the fused output features.
And step 104, performing convolution operation and upsampling processing on the fusion features layer by layer through a multilayer feature fusion decoding network, and outputting a semantic segmentation graph.
In particular, in the encoder-decoder architecture, the loss of detail information may result due to multiple downsampling at the encoder stage. Assuming that the decoding stage directly utilizes the fusion features extracted by the backbone network at the final stage to perform upsampling, the finally obtained segmentation result is inaccurate in the boundary region, and even results in wrong segmentation in the boundary region. Based on this, as shown in fig. 4, which is a schematic structural diagram of a multi-layer feature fusion decoding network provided in this embodiment, a preset decoding algorithm is adopted by the multi-layer feature fusion decoding network, and the fusion features are subjected to convolution operation and upsampling processing layer by layer, so as to output a semantic segmentation map, where the decoding algorithm is expressed as:
Gfusion=F3,2,1+FNBt1D(δ(Fc(G)))
wherein, F3,,Representing the fusion characteristics obtained in the encoding phase, FNBt1DThree layers NBt1D are shown, each NBt1D module comprises four convolution layers, is a deconvolution, and has smaller parameter and less operation quantity compared with the common convolution, wherein delta represents Relu activation function, FcRepresenting a 3 x 3 convolutional layer, G representing the input characteristics of each stage of the decoding network, GfusionAnd the semantic segmentation graph which represents the output of each stage is restored to the size of the original image through upsampling.
Therefore, the embodiment adds the fusion features of the encoding stage in the decoding stage, and utilizes the NBt1D module to add the fusion features into the decoder layer by layer, and the NBt1D module has the characteristic of light weight, so that a high-resolution accurate semantic segmentation map can be constructed without introducing large parameters.
According to the image semantic segmentation method for attention-guided multi-modal feature fusion provided by the embodiment of the invention, firstly, features are extracted from an RGB (red, green and blue) image and a depth image by using a feature extraction network, and the feature refining function is controlled by using the attention. In addition, in order to better fuse two-part features, the invention utilizes the complementarity of the two parts to realize cross-complementary aggregation of features in a space dimension, and finally realizes cross-modal feature fusion. In the encoding stage, due to the fact that multiple times of downsampling may cause loss of detail information, the upsampling in the decoding stage can introduce fusion features with stronger representation capability obtained in the encoding stage. In addition, experiments are carried out on the RGB-D semantic segmentation data sets Nyuv2 and SUNRGBD, and the results show that the method is better than most of the existing methods in the RGB-D semantic segmentation task, and has the advantages of good segmentation effect and small parameter quantity.
Second embodiment:
in order to solve the technical problems of poor robustness, low segmentation accuracy and the like of an RGB-D image semantic segmentation model provided in the related art, this embodiment shows an image semantic segmentation apparatus, which is applied to an overall neural network including a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and specifically refer to fig. 5, the image semantic segmentation apparatus of this embodiment includes:
an extraction module 501, configured to perform feature extraction processing on the color image and the corresponding depth image through a feature extraction network, respectively, to obtain a color image feature and a depth image feature;
an alignment module 502, configured to connect the color image features and the depth image features along the channel dimension through a multi-modal feature alignment network, perform convolution operation to obtain mixed features, and align the color image features and the depth image features in the channel dimension and the spatial dimension based on the mixed features to obtain aligned color image features and aligned depth image features;
the fusion module 503 is configured to obtain a first weight matrix of importance degrees of each position point of the aligned color image feature and a second weight matrix of importance degrees of each position point of the aligned depth image feature through a cross-modal feature fusion network, and then fuse the first weight matrix with the aligned color image feature and fuse the second weight matrix with the aligned depth image feature and then perform superposition processing to obtain a fusion feature;
and the decoding module 504 is configured to perform convolution operation and upsampling processing on the fusion features layer by layer through a multi-layer feature fusion decoding network, and output a semantic segmentation map.
In an optional implementation manner of this embodiment, the alignment module is specifically configured to: performing global average pooling on the mixed features in the horizontal direction and the vertical direction respectively to obtain a feature vector in the horizontal direction and a feature vector in the vertical direction respectively; connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along the channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and coded intermediate feature map; and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.
Further, in an optional implementation manner of this embodiment, the alignment module is specifically configured to: performing feature alignment on the compressed and coded intermediate feature map by adopting a preset alignment repair algorithm to obtain aligned color map features and aligned depth map features;
the alignment repair algorithm is represented as:
Figure BDA0003446793930000091
Figure BDA0003446793930000092
wherein X ∈ RC×H×W、D∈RC×H×WRespectively representing feature extraction network to obtain color image features and depthGraph characteristics, σ denotes a sigmoid activation function,
Figure BDA0003446793930000093
and
Figure BDA0003446793930000094
representing two independent feature maps into which the compressed and encoded intermediate feature map is divided along the channel dimension,
Figure BDA0003446793930000095
representing a convolution operation, Xrep∈RC×H×W、Drep∈RC×H×WRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics.
In an optional implementation manner of this embodiment, the fusion module is specifically configured to: respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image characteristic and a second weight matrix of the importance degree of each position point of the aligned depth image characteristic by adopting a preset difference distribution learning algorithm through a cross-modal characteristic fusion network;
the difference distribution learning algorithm is expressed as:
Figure BDA0003446793930000096
wherein, Xrep、DrepRespectively representing the aligned color map features and the aligned depth map features,
Figure BDA0003446793930000097
representing convolution operations, | | representing connections along the channel dimension, | representing the softmax activation function, GX′∈R1×H×W、GD′∈R1×H×WRespectively representing a first weight matrix and a second weight matrix.
In an optional implementation manner of this embodiment, the decoding module is specifically configured to: performing convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through a multilayer feature fusion decoding network, and outputting a semantic segmentation graph;
the decoding algorithm is represented as:
Gfusion=F3,2,1+FNBt1D(δ(Fc(G)))
wherein, F3,2,1Representing the fusion characteristics of the encoding phase, FNBt1DThree layers NBt1D are shown, each NBt1D module containing four convolutional layers, δ representing the Relu activation function, FcRepresenting a 3 × 3 convolutional layer, G representing the input characteristics of each stage, GfusionRepresenting a semantic segmentation graph.
It should be noted that, the image semantic segmentation method in the foregoing embodiment can be implemented based on the image semantic segmentation device provided in this embodiment, and it can be clearly understood by those skilled in the art that, for convenience and simplicity of description, a specific working process of the image semantic segmentation device described in this embodiment may refer to a corresponding process in the foregoing method embodiment, and details are not described here.
The image semantic segmentation device for guiding multi-modal feature fusion by adopting the attention provided by the embodiment is used for mixing the extracted color image features and the extracted depth image features; refining and adding the mixed features into the input features in two dimensions of a channel and a space, eliminating noise existing in a depth map, and aligning the two parts of features in a self-adaptive manner; in order to further complementarily fuse the two part features, the complementary fusion of the multi-modal features is realized by acquiring the importance degrees of the corresponding positions of the two parts of features and learning the complementary relation between the color image and the depth image in a self-adaptive manner; in order to introduce important spatial detail information in a decoding stage, a multi-layer feature fusion method is adopted to introduce fusion features in a coding stage, more detail information is added, more information is focused on a boundary region during segmentation, fine segmentation of the boundary region is realized, and a more accurate and efficient semantic segmentation map is generated. Therefore, the robustness and the segmentation precision of the RGB-D image semantic segmentation model are effectively improved.
The third embodiment:
the present embodiment provides an electronic device, as shown in fig. 6, which includes a processor 601, a memory 602, and a communication bus 603, wherein: the communication bus 603 is used for realizing connection communication between the processor 601 and the memory 602; the processor 601 is configured to execute one or more computer programs stored in the memory 602 to implement at least one step of the image semantic segmentation method in the first embodiment.
The present embodiments also provide a computer-readable storage medium including volatile or non-volatile, removable or non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media include, but are not limited to, RAM (Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact disk Read-Only Memory), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
The computer-readable storage medium in this embodiment may be used for storing one or more computer programs, and the stored one or more computer programs may be executed by a processor to implement at least one step of the method in the first embodiment.
The present embodiment also provides a computer program, which can be distributed on a computer readable medium and executed by a computing device to implement at least one step of the method in the first embodiment; and in some cases at least one of the steps shown or described may be performed in an order different than that described in the embodiments above.
The present embodiments also provide a computer program product comprising a computer readable means on which a computer program as shown above is stored. The computer readable means in this embodiment may include a computer readable storage medium as shown above.
It will be apparent to those skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software (which may be implemented in computer program code executable by a computing device), firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit.
In addition, communication media typically embodies computer readable instructions, data structures, computer program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to one of ordinary skill in the art. Thus, the present invention is not limited to any specific combination of hardware and software.
The foregoing is a more detailed description of embodiments of the present invention, and the present invention is not to be considered limited to such descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. An attention-guided multi-modal feature fusion image semantic segmentation method is applied to an overall neural network comprising a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and is characterized by comprising the following steps of:
respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;
connecting the color image features and the depth image features along a channel dimension through the multi-mode feature alignment network, performing convolution operation to obtain mixed features, aligning the color image features and the depth image features on the channel dimension and the space dimension based on the mixed features, and obtaining the aligned color image features and the aligned depth image features;
respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image characteristic and a second weight matrix of the importance degree of each position point of the aligned depth image characteristic through the cross-modal characteristic fusion network, and then fusing the first weight matrix with the aligned color image characteristic and fusing the second weight matrix with the aligned depth image characteristic and then performing superposition processing to obtain a fusion characteristic;
and performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph.
2. The image semantic segmentation method according to claim 1, wherein the step of aligning the color map features and the depth map features in a channel dimension and a space dimension based on the mixture features to obtain aligned color map features and aligned depth map features comprises:
respectively carrying out global average pooling on the mixed features in the horizontal direction and the vertical direction to respectively obtain a feature vector in the horizontal direction and a feature vector in the vertical direction;
connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along a channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and encoded intermediate feature map;
and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.
3. The image semantic segmentation method according to claim 2, wherein the step of performing feature alignment based on the compressed and encoded intermediate feature map to obtain an aligned color map feature and an aligned depth map feature comprises:
performing feature alignment on the compressed and coded intermediate feature map by adopting a preset alignment repair algorithm to obtain aligned color map features and aligned depth map features;
the alignment repair algorithm is represented as:
Figure FDA0003446793920000021
Figure FDA0003446793920000022
wherein X ∈ RC×H×W、D∈RC×H×WRespectively representing the color image characteristics and the depth image characteristics obtained by the characteristic extraction network, representing sigmoid activation function by sigma,
Figure FDA0003446793920000023
and
Figure FDA0003446793920000024
representing two independent feature maps into which the compressed and encoded intermediate feature map is divided along the channel dimension,
Figure FDA0003446793920000025
representing a convolution operation, Xrep∈RC×H×W、Drep∈RC×H×WRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics.
4. The image semantic segmentation method according to claim 1, wherein the step of obtaining a first weight matrix of the importance degree of each position point of the aligned color image feature and a second weight matrix of the importance degree of each position point of the aligned depth image feature through the cross-modal feature fusion network respectively comprises:
respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image features and a second weight matrix of the importance degree of each position point of the aligned depth image features by adopting a preset difference distribution learning algorithm through the cross-modal feature fusion network;
the difference distribution learning algorithm is expressed as:
Figure FDA0003446793920000026
wherein, Xrep、DrepRespectively representing the aligned color map features and the aligned depth map features,
Figure FDA0003446793920000027
representing convolution operations, | | representing connections along the channel dimension, | representing the softmax activation function, GX′∈R1×H×W、GD′∈R1×H×WRespectively representing the first weight matrix and the second weight matrix.
5. The image semantic segmentation method according to any one of claims 1 to 4, wherein the step of performing convolution operation and upsampling processing on the fused features layer by layer through the multi-layer feature fusion decoding network to output a semantic segmentation map comprises:
performing convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph;
the decoding algorithm is represented as:
Gfusion=F3,2,1+FNBt1D(δ(Fc(G)))
wherein, F3,2,1Said fusion characteristic representing the encoding phase, FNBt1DThree layers NBt1D are shown, each of the NBt1D modules containing four convolutional layers, δ representing the Relu activation function, FcRepresenting a 3 × 3 convolutional layer, G representing the input characteristics of each stage, GfusionRepresenting the semantic segmentation graph.
6. An attention-guided multi-modal feature fusion image semantic segmentation device is applied to an overall neural network comprising a feature extraction network, a multi-modal feature alignment network, a cross-modal feature fusion network and a multi-layer feature fusion decoding network, and is characterized by comprising:
the extraction module is used for respectively carrying out feature extraction processing on the color image and the corresponding depth image through the feature extraction network to obtain color image features and depth image features;
the alignment module is used for performing convolution operation to obtain mixed features after the color image features and the depth image features are connected along channel dimensions through the multi-mode feature alignment network, and aligning the color image features and the depth image features on the channel dimensions and the space dimensions based on the mixed features to obtain the aligned color image features and the aligned depth image features;
the fusion module is used for respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image feature and a second weight matrix of the importance degree of each position point of the aligned depth image feature through the cross-modal feature fusion network, then fusing the first weight matrix and the aligned color image feature, and fusing the second weight matrix and the aligned depth image feature and then performing superposition processing to obtain a fusion feature;
and the decoding module is used for performing convolution operation and up-sampling processing on the fusion features layer by layer through the multilayer feature fusion decoding network and outputting a semantic segmentation graph.
7. The image semantic segmentation apparatus according to claim 6, wherein the alignment module is specifically configured to:
respectively carrying out global average pooling on the mixed features in the horizontal direction and the vertical direction to respectively obtain a feature vector in the horizontal direction and a feature vector in the vertical direction;
connecting the feature vectors in the horizontal direction and the feature vectors in the vertical direction along a channel dimension, and compressing by adopting a nonlinear activation function to obtain a compressed and encoded intermediate feature map;
and performing feature alignment based on the compressed and coded intermediate feature map to obtain aligned color map features and aligned depth map features.
8. The image semantic segmentation apparatus of claim 7, wherein the alignment module is further configured to:
performing feature alignment on the compressed and coded intermediate feature map by adopting a preset alignment repair algorithm to obtain aligned color map features and aligned depth map features;
the alignment repair algorithm is represented as:
Figure FDA0003446793920000031
Figure FDA0003446793920000032
wherein X ∈ RC×H×W、D∈RC×H×WRespectively representing the color image characteristics and the depth image characteristics obtained by the characteristic extraction network, representing sigmoid activation function by sigma,
Figure FDA0003446793920000041
and
Figure FDA0003446793920000042
representing two independent feature maps into which the compressed and encoded intermediate feature map is divided along the channel dimension,
Figure FDA0003446793920000043
representing a convolution operation, Xrep∈RC×H×W、Drep∈RC×H×WRespectively representing the finally output aligned color image characteristics and the aligned depth image characteristics.
9. The image semantic segmentation apparatus according to claim 6, wherein the fusion module is specifically configured to:
respectively acquiring a first weight matrix of the importance degree of each position point of the aligned color image features and a second weight matrix of the importance degree of each position point of the aligned depth image features by adopting a preset difference distribution learning algorithm through the cross-modal feature fusion network;
the difference distribution learning algorithm is expressed as:
Figure FDA0003446793920000044
wherein, Xrep、DrepRespectively representing the aligned color map features and the aligned depth map features,
Figure FDA0003446793920000045
representing convolution operations, | | representing connections along the channel dimension, | representing the softmax activation function, GX′∈R1×H×W、GD′∈R1×H×WRespectively representing the first weight matrix and the second weight matrix.
10. The image semantic segmentation apparatus according to any one of claims 6 to 9, wherein the decoding module is specifically configured to:
performing convolution operation and up-sampling processing on the fusion features layer by adopting a preset decoding algorithm through the multilayer feature fusion decoding network, and outputting a semantic segmentation graph;
the decoding algorithm is represented as:
Gfusion=F3,2,1+FNBt1D(δ(Fc(G)))
wherein, F3,2,1Said fusion characteristic representing the encoding phase, FNBt1DThree layers NBt1D are shown, each of the NBt1D modules containing four convolutional layers, δ representing the Relu activation function, FcRepresenting a 3 × 3 convolutional layer, G representing the input characteristics of each stage, GfusionRepresenting the semantic segmentation graph.
CN202111658857.9A 2021-12-30 2021-12-30 Image semantic segmentation method and device for attention-guided multi-modal feature fusion Active CN114372986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111658857.9A CN114372986B (en) 2021-12-30 2021-12-30 Image semantic segmentation method and device for attention-guided multi-modal feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111658857.9A CN114372986B (en) 2021-12-30 2021-12-30 Image semantic segmentation method and device for attention-guided multi-modal feature fusion

Publications (2)

Publication Number Publication Date
CN114372986A true CN114372986A (en) 2022-04-19
CN114372986B CN114372986B (en) 2024-05-24

Family

ID=81141205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111658857.9A Active CN114372986B (en) 2021-12-30 2021-12-30 Image semantic segmentation method and device for attention-guided multi-modal feature fusion

Country Status (1)

Country Link
CN (1) CN114372986B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842312A (en) * 2022-05-09 2022-08-02 深圳市大数据研究院 Generation and segmentation method and device for unpaired cross-modal image segmentation model
CN115496975A (en) * 2022-08-29 2022-12-20 锋睿领创(珠海)科技有限公司 Auxiliary weighted data fusion method, device, equipment and storage medium
CN116109645A (en) * 2023-04-14 2023-05-12 锋睿领创(珠海)科技有限公司 Intelligent processing method, device, equipment and medium based on priori knowledge
CN116935052A (en) * 2023-07-24 2023-10-24 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment
CN116978011A (en) * 2023-08-23 2023-10-31 广州新华学院 Image semantic communication method and system for intelligent target recognition
CN117014633A (en) * 2023-10-07 2023-11-07 深圳大学 Cross-modal data compression method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298361A (en) * 2019-05-22 2019-10-01 浙江省北大信息技术高等研究院 A kind of semantic segmentation method and system of RGB-D image
CN110929696A (en) * 2019-12-16 2020-03-27 中国矿业大学 Remote sensing image semantic segmentation method based on multi-mode attention and self-adaptive fusion
CN112634296A (en) * 2020-10-12 2021-04-09 深圳大学 RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
WO2021088300A1 (en) * 2019-11-09 2021-05-14 北京工业大学 Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network
CN113205520A (en) * 2021-04-22 2021-08-03 华中科技大学 Method and system for semantic segmentation of image

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298361A (en) * 2019-05-22 2019-10-01 浙江省北大信息技术高等研究院 A kind of semantic segmentation method and system of RGB-D image
WO2021088300A1 (en) * 2019-11-09 2021-05-14 北京工业大学 Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network
CN110929696A (en) * 2019-12-16 2020-03-27 中国矿业大学 Remote sensing image semantic segmentation method based on multi-mode attention and self-adaptive fusion
CN112634296A (en) * 2020-10-12 2021-04-09 深圳大学 RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
CN113205520A (en) * 2021-04-22 2021-08-03 华中科技大学 Method and system for semantic segmentation of image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张娣;陆建峰;: "基于双目图像与跨级特征引导的语义分割模型", 计算机工程, no. 10, 15 October 2020 (2020-10-15) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842312A (en) * 2022-05-09 2022-08-02 深圳市大数据研究院 Generation and segmentation method and device for unpaired cross-modal image segmentation model
CN115496975A (en) * 2022-08-29 2022-12-20 锋睿领创(珠海)科技有限公司 Auxiliary weighted data fusion method, device, equipment and storage medium
CN115496975B (en) * 2022-08-29 2023-08-18 锋睿领创(珠海)科技有限公司 Auxiliary weighted data fusion method, device, equipment and storage medium
CN116109645A (en) * 2023-04-14 2023-05-12 锋睿领创(珠海)科技有限公司 Intelligent processing method, device, equipment and medium based on priori knowledge
CN116935052A (en) * 2023-07-24 2023-10-24 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment
CN116935052B (en) * 2023-07-24 2024-03-01 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment
CN116978011A (en) * 2023-08-23 2023-10-31 广州新华学院 Image semantic communication method and system for intelligent target recognition
CN116978011B (en) * 2023-08-23 2024-03-15 广州新华学院 Image semantic communication method and system for intelligent target recognition
CN117014633A (en) * 2023-10-07 2023-11-07 深圳大学 Cross-modal data compression method, device, equipment and medium
CN117014633B (en) * 2023-10-07 2024-04-05 深圳大学 Cross-modal data compression method, device, equipment and medium

Also Published As

Publication number Publication date
CN114372986B (en) 2024-05-24

Similar Documents

Publication Publication Date Title
CN114372986A (en) Attention-guided multi-modal feature fusion image semantic segmentation method and device
Guo et al. Dense scene information estimation network for dehazing
CN111402146B (en) Image processing method and image processing apparatus
WO2021164731A1 (en) Image enhancement method and image enhancement apparatus
KR20210031427A (en) Methods, devices, computer devices and media for recognizing traffic images
CN111353948B (en) Image noise reduction method, device and equipment
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
WO2018168539A1 (en) Learning method and program
US20220270215A1 (en) Method for applying bokeh effect to video image and recording medium
CN111145290A (en) Image colorization method, system and computer readable storage medium
CN114038006A (en) Matting network training method and matting method
US20240112404A1 (en) Image modification techniques
Yuan et al. Multiview scene image inpainting based on conditional generative adversarial networks
CN116563459A (en) Text-driven immersive open scene neural rendering and mixing enhancement method
Liang et al. Learning to remove sandstorm for image enhancement
CN113487530A (en) Infrared and visible light fusion imaging method based on deep learning
CN113436107A (en) Image enhancement method, intelligent device and computer storage medium
CN116993987A (en) Image semantic segmentation method and system based on lightweight neural network model
CN117456330A (en) MSFAF-Net-based low-illumination target detection method
CN117078574A (en) Image rain removing method and device
CN115965531A (en) Model training method, image generation method, device, equipment and storage medium
CN116309215A (en) Image fusion method based on double decoders
CN116342877A (en) Semantic segmentation method based on improved ASPP and fusion module in complex scene
CN115423697A (en) Image restoration method, terminal and computer storage medium
CN115205487A (en) Monocular camera face reconstruction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant