CN116091765A - RGB-T image semantic segmentation method and device - Google Patents
RGB-T image semantic segmentation method and device Download PDFInfo
- Publication number
- CN116091765A CN116091765A CN202211715697.1A CN202211715697A CN116091765A CN 116091765 A CN116091765 A CN 116091765A CN 202211715697 A CN202211715697 A CN 202211715697A CN 116091765 A CN116091765 A CN 116091765A
- Authority
- CN
- China
- Prior art keywords
- rgb
- image
- semantic segmentation
- fusion
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 287
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000004927 fusion Effects 0.000 claims abstract description 116
- 230000000295 complement effect Effects 0.000 claims abstract description 27
- 238000002372 labelling Methods 0.000 claims abstract description 24
- 238000012549 training Methods 0.000 claims description 24
- 238000011176 pooling Methods 0.000 claims description 22
- 230000003044 adaptive effect Effects 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 17
- 230000007246 mechanism Effects 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 15
- 239000000654 additive Substances 0.000 claims description 14
- 230000000996 additive effect Effects 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 11
- 239000003638 chemical reducing agent Substances 0.000 claims description 5
- 238000011156 evaluation Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 230000015572 biosynthetic process Effects 0.000 claims 1
- 238000003786 synthesis reaction Methods 0.000 claims 1
- 238000005286 illumination Methods 0.000 abstract description 7
- 238000005065 mining Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 230000008447 perception Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009469 supplementation Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/40—Analysis of texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/778—Active pattern-learning, e.g. online learning of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10048—Infrared image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a RGB-T image semantic segmentation method and a device, comprising the following steps: the space cross-modal information fusion, multi-scale feature iterative fusion and RGB image random mask data enhancement methods are utilized in advance, an RGB-T image semantic segmentation model is trained on the basis of a semi-labeled RGB-T image pair data set, so that the mining capacity of cross-modal space complementary information of the RGB-T image semantic segmentation model and semantic segmentation performance under poor illumination conditions are improved, and the labeling cost is reduced. In the application stage, the semantic segmentation image of the target RGB-T image pair is generated by utilizing the RGB-T image semantic segmentation model, and the accuracy of the semantic segmentation result is improved.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to an RGB-T image semantic segmentation method and device.
Background
Semantic segmentation aims at assigning a class label to each pixel in an RGB image, which is one of the key technologies for scene perception and plays a vital role in computer vision tasks such as automatic driving, pedestrian detection, remote sensing image analysis and the like.
Because partial region texture information of the RGB image may be missing under the condition of poor illumination (too low brightness or overexposure), the direct semantic segmentation of the RGB image with the missing texture information may lead to unreliable semantic segmentation results. Thus, RGB-T semantic segmentation using thermal infrared images to provide texture information supplementation to RGB images has resulted. In the existing RGB-T semantic segmentation, the mode of mode feature self-enhancement post-additive fusion or channel dimension fusion after mode feature alignment is mostly adopted to realize the fusion of RGB features and thermal infrared features, and the fusion features are utilized to complete the image semantic segmentation.
However, the spatial complementarity between the modal features is not fully utilized by the modal feature self-enhancement post-additive fusion and the modal feature alignment post-channel dimension fusion, so that the existing RGB-T semantic segmentation performance is poor.
Disclosure of Invention
The invention provides a RGB-T image semantic segmentation method and a device, which are used for solving the problem of poor semantic segmentation performance caused by insufficient utilization of space complementarity between RGB features and thermal infrared features in the prior art, and training an RGB-T image semantic segmentation model on the basis of a semi-labeled RGB-T image pair data set by utilizing a space cross-modal information fusion, multi-scale feature iteration fusion and RGB image random mask data enhancement method so as to enhance the mining capability of cross-modal space complementary information of the RGB-T image semantic segmentation model, so that the RGB-T image semantic segmentation model has the characteristics of low labeling cost and high semantic segmentation performance under the condition of poor illumination; and furthermore, accurate semantic segmentation can be realized by utilizing an RGB-T image semantic segmentation model.
In a first aspect, the present invention provides a method for semantic segmentation of RGB-T images, the method comprising:
calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
In a second aspect, the present invention provides an RGB-T image semantic segmentation apparatus, the apparatus comprising:
the calling module is used for calling the RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
the generation module is used for inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
the selecting module is used for selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
The invention provides a RGB-T image semantic segmentation method and a device, which are characterized in that an RGB-T image semantic segmentation data set is utilized in advance to train a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an RGB-T image semantic segmentation method provided by the invention;
FIG. 2 is a schematic diagram of a dual-branch RGB-T semantic segmentation network provided by the present invention;
FIG. 3 is a diagram of an example fully supervised learning provided by the present invention;
FIG. 4 is a diagram of cross-modal mutual learning examples provided by the present invention;
FIG. 5 is a RGB-T feature flow provided by the present invention;
FIG. 6 is a schematic diagram of a spatial cross-modal information fusion module provided by the invention;
FIG. 7 is a schematic structural diagram of a multi-scale feature iterative fusion module provided by the present invention;
FIG. 8 is a schematic flow chart of an RGB-T image semantic segmentation device provided by the invention;
fig. 9 is a schematic structural diagram of an electronic device provided by the present invention;
reference numerals:
910: a processor; 920: a communication interface; 930: a memory; 940: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
First, abbreviations and key term definitions in the art will be explained:
mIoU: mean intersection of union average cross-over ratio
SCF: spatial-wise cross-model fusion, spatial cross-modal information fusion
RMM: multi-scale measurement of repetitive multi-scale feature fusion
M-CutOut: mono-model CutOut, single mode random mask data enhancement
Conv: convolition, convolution operation
CD: channel-wise denoise, channel adaptive noise reduction
ADM: attentive demand map spatially adaptive demand pattern evaluation
CF: cross-model fusion
ASPP: atrous spatial pyramid pooling spatial pyramid pooling
SF: spatial-wise fusion, spatial dimension fusion
CA: channel-wise attention, channel attention mechanism
SA: spatial-wise attention, spatial attention mechanism
BN: batch normalization batch normalization operation
MLP: multilayer Perceptron, multilayer perceptron
The following describes a semantic segmentation method and device for RGB-T images according to the present invention with reference to FIGS. 1 to 9.
In a first aspect, the present invention provides a method for semantic segmentation of RGB-T images, as shown in fig. 1, including:
s11, calling an RGB-T image semantic segmentation model;
specifically, the RGB-T image semantic segmentation model is pre-constructed, and the construction process comprises the following steps:
constructing an RGB-T image semantic segmentation data set; wherein, the RGB-T image semantic segmentation dataset can be acquired as follows:
step A: collecting RGB-T data pairs to form a data set T;
alternatively, the data set T may employ open source data, such as a road scene MFNet data set and a subsurface scene PST900 data set.
And (B) step (B): performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set T to obtain a first data set;
the invention can be directly applied to semi-supervision tasks (namely, part of RGB-T image pairs are used as labeled image pairs and the other part of RGB-T image pairs are used as unlabeled image pairs) through a cross-modal mutual learning method because the pixel-level semantic segmentation labeling cost of the RGB-T image pairs is very high, so that the effect of reducing the labeling cost of an RGB-T image semantic segmentation model is achieved.
Step C: performing data enhancement on the first data set by adopting an RGB image random mask mode to obtain an RGB-T image semantic segmentation data set;
in order to fully utilize the labeled image pair, the invention provides a single-mode random mask data enhancement method M-CutOut, and the random mask operation is carried out on the RGB image in the RGB-T image pair in the first data set so as to manually introduce a new space complementary information region, thereby promoting the RGB-T image semantic segmentation model to better utilize the cross-mode space complementary texture information in the training process.
Further, performing a random masking operation on the RGB image in the RGB-T image pair, including:
firstly initializing a mask M of all 1, then determining a rectangular area proportional to the image scale at a random position, setting the area of M to be 0, and finally multiplying the mask with the original RGB image pixel by pixel to form a new visible light texture information missing area.
Constructing a double-branch RGB-T semantic segmentation network; wherein, RGB branch and hot infrared branch in the two branch RGB-T semantic segmentation network all adopt encoding-decoding structure, wherein:
the RGB branch comprises a first input layer, a first feature encoder, a first feature decoder, a first pixel level classifier and a main prediction output layer;
The thermal infrared branch comprises a second input layer, a second feature encoder, a second feature decoder, a second pixel level classifier and an auxiliary prediction output layer;
the first feature encoder includes K feature extraction layers;
the second feature encoder includes K feature extraction layers;
the first feature encoder and the second feature encoder jointly comprise K space cross-modal information fusion modules;
the first feature decoder comprises a first multi-scale feature iterative fusion module;
the second feature decoder includes a second multi-scale feature iterative fusion module. FIG. 2 provides a schematic diagram of a dual-branch RGB-T semantic segmentation network (the drawings of the present invention are each illustrated with K=5, without loss of generality), in which SCF (Spatial-temporal Cross-mode Fusion) represents a Spatial Cross-modality information Fusion module, RMM (repetitive Multi-scale measurement) represents a Multi-scale feature iterative Fusion module, L i The feature extraction layer with index i is indicated.
From fig. 2, it can be seen that the RGB branches and the thermal infrared branches in the dual-branch RGB-T semantic segmentation network progressively perform feature extraction by using K-layer feature extraction layers in the coding structure, and gradually perform spatial cross-modal information fusion by using the SCF module embedded after each feature extraction layer; and in the decoding structure, the RMM module is used for carrying out iterative feature fusion of spatial dimensions on multi-scale fusion features so as to make up for information loss caused by spatial downsampling in feature encoding, and the scale transformation is realized by an upsampling method (such as bilinear upsampling) in feature decoding. The spatial cross-modal information fusion and the multi-scale feature iterative fusion jointly enable the dual-branch RGB-T semantic segmentation network to have the capacity of deeply mining cross-modal spatial complementary texture features of RGB-T image pairs.
Constructing a loss function; wherein, the invention directly applies cross-mode mutual learning to semi-supervised semantic segmentation task when training RGB-T image semantic segmentation modelRGB-T image pair with pixel level semantic segmentation labels, denoted G +.>Is marked with +.>RGB-T image pairs representing non-pixel level semantic segmentation labels and for +.>In a fully supervised learning mode as shown in FIG. 3, forThe cross-modal mutual learning mode shown in fig. 4 is adopted.
wherein ,total training loss for RGB-T image pairs with pixel-level semantic segmentation labels, +.>Total training loss, h, for RGB-T image pairs labeled for non-pixel level semantic segmentation rgb (. Cndot.) is the RGB branch of the RGB-T image semantic segmentation model, h th e (·) is the thermal infrared branch of the RGB-T image semantic segmentation model, CE (·) represents the cross entropy loss function, Y rgb Is thatCorresponding pseudo tag, ytheA is +.>Corresponding pseudo tag, M is->Random mask pattern of->Can be regarded as +.>And (3) carrying out semantic segmentation prediction corresponding to a single pixel point on the result after the M-CutOut data is enhanced. Pseudo tag Y rgb and Ythe Is generated by using the semantic segmentation prediction result of the conventional weak data enhancement (such as flipping and the like) data.
Training an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set and a loss function.
The RGB-T image semantic segmentation model trained by the invention has low labeling cost and high semantic segmentation performance under the condition of poor illumination.
3 existing schemes, scheme 1 (patent CN112991350 a), scheme 2 (patent CNll3362349 a) and scheme 3 (patent CN 113781504A), are proposed, all of which adopt a fully supervised training mode to build a model, while the present invention directly applies to a semi-supervised task (only half of MFNet dataset data is used as marked data and the other half is used as unmarked data) training model through a cross-modal mutual learning method.
Table 1 is a model test effect comparison table on the road scene RGB-T image dataset MFNet;
table 2 is a semi-supervised semantic segmentation effect contrast table on the road scene RGB-T image dataset MFNet.
Tables 1 and 2 show that the semantic segmentation effect of the method is better under the same experimental setting, and the effect that all tagged data are used in the prior art scheme can be achieved by only using half of tagged data.
TABLE 1
TABLE 2
S12, inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
specifically, the K feature extraction layers included in the first feature encoder are respectively denoted as L rgb,0 To L rgb,K-1 ;
The K feature extraction layers included in the second feature encoder are respectively denoted as L the,0 To L the,K-1 ;
K spatial cross-modal information fusion modules jointly included by the first feature encoder and the second feature encoder are respectively recorded as SCF 0 To SCF K - 1 ;
The S12 includes:
transmitting an RGB image of the target RGB-T image pair to the first feature encoder through the first input layer, and transmitting a thermal infrared image of the target RGB-T image pair to the second feature encoder through the second input layer;
utilizing the L in a combined structure of the first feature encoder and the second feature encoder rg b ,i Feature extraction is carried out on the input RGB information to obtain frgb, i, and the L is utilized the,i Feature extraction is carried out on the input thermal infrared information to obtain f the,i And utilize the SCF i For fr gb,i and fthe I, performing spatial cross-modal information fusion to obtain f rgb,i and fthe I; wherein i is E [0, (K-1)]And when i=0, L rgb,i The RGB information input in (1) is the RGB image in the target RGB-T image pair, L rgb,i The thermal infrared information input in the process is the thermal infrared image in the target RGB-T image pair, when i is E [1, (K-1)]When L rgb,i The RGB information input in the SCF is utilized i - 1 The f' r obtained gb,i - 1 ,L rgb The thermal infrared information input in i is the SCF i - 1 Obtained f the,i - 1 ;
Iterative fusion of module pairs fr using first multi-scale features gb,K - 1 、f m,k - 2 、f m,k - 3 and fm ,k- 4 Performing space dimension iterative fusion to obtain decoding characteristicsIterative fusion of module pairs f using second multi-scale features the ,K- 1 f m,k - 2 、f m,k - 3 and fm,k - 4 Performing space dimension iterative fusion to obtain decoding characteristics +.>Wherein j is 2-K, f m,k-j Is fr gb ,k -j And f the ,k- j The characteristics obtained after the additive fusion are carried out;
will be and />Performing additive fusion to obtain a first additive fusion feature, and processing the first additive fusion feature by using a first pixel-level classifier to obtain the first semantic segmentation image yr gb And outputting the first semantically divided image yr through the main prediction output layer gb The method comprises the steps of carrying out a first treatment on the surface of the Processing +.>Obtaining the second semantic segmentation image ythe and outputting the second semantic segmentation image y through the auxiliary prediction output layer the ;
further, fig. 5 shows the RGB-T feature flow of the feature encoding and decoding stage, where CD (channel-wise denoise) is a channel adaptive noise reducer, ADM (Attentive demand map) is a Spatial adaptive demand graph estimator, CF (Cross-model fusion) is a Cross-modal fusion, ASPP (Atrous Spatial pyramidpooling) is a Spatial pyramid pooler, and SF (Spatial-wise fusion) is a Spatial dimension fusion. Wherein, CD, ADM and CF together form the SCF module, ASPP and SF form the RMM module.
Specifically, fig. 6 provides a schematic structural diagram of a spatial cross-mode information fusion module, where in a real scene, both an RGB image and an infrared thermal image are inevitably interfered by noise in a complex environment, such as strong light irradiation, temperature fluctuation caused by an abnormal heat source, and the like. In view of the above noise, the proposed SCF module applies two attention mechanisms, the Channel adaptive noise reducer CD determines which Channel is more reliable through the Channel attention mechanism CA (Channel-wise) and the Spatial adaptive demand pattern evaluator ADM determines which region has more Spatial complementary information fusion requirements through the Spatial attention mechanism SA (Spatial-wise). There are a number of engineering implementations of the two attention mechanisms described above, such as those in CBAM (s.woo, j.park, J-yle, and i.s.kweon.cbam: convolitionallblockattententment module.InECCV, pages3-19, 2018.), which will be readily appreciated, extended and implemented by the skilled artisan.
Thus, the SCF is utilized i For fr gb,i and fthe,i Performing spatial cross-modal information fusion to obtain fr g b ,i and fthe I, comprising:
in the channel adaptive noise reducer, for fr gb,i Performing maximum value pooling and mean value pooling treatment to obtain a first maximum value pooling characteristic MaxPool (f) rg b, i) and a first mean pooling feature MeanPool (f) rgb,i ) Based on the MaxPool (f rgb,i ) And the MeanPool (f) rgb I) generating a first channel attention map using a channel attention mechanismAnd add said->And fr is equal to gb,i Is the product of fr gb,i Noise reduction features of->
At the same time, for f the,i Performing maximum value pooling and mean value pooling treatment to obtain a second maximum value pooling characteristic MaxPool (f) the,i ) And a second averaged pooling feature MeanPool (f) the,i ) Based on the MaxPool (f the,i ) And the MeanPool (f) the,i ) The channel attention mechanism generates a second channel attention mapAnd add said->And f is equal to the The product of i is taken as f the,i Noise reduction features of->
wherein, MLP is the multilayer perceptron, sigmoid (·) is Sigmoid function.
In a spatially adaptive demand graph evaluator, based onKnow->Generating a first spatially adaptive demand graph using a spatial attention mechanism >At the same time based on->It is known thatGenerating a second spatially adaptive demand graph using a spatial attention mechanism>
conv (·) is a convolution function, 7*7 convolution kernel is adopted, sigmoid (·) is a Sigmoid function, and the function of the space self-adaptive demand graph is to represent the space complementary information fusion demand of the region.
In a cross-modal fusion device and />Dot product and->Carrying out additive fusion to obtain the frgb, i, and carrying out +.> and />Dot product and->Carrying out additive fusion to obtain the f the,i 。/>
Fig. 7 provides a schematic structural diagram of a multi-scale feature iterative fusion module, in which Conv units are composed of convolution operations, batch normalization operations (Batch normalization), and Relu activation. The spatial downsampling in the feature encoding process inevitably causes information loss, the semantic segmentation task at the pixel level depends on detail texture features, and the proposed RMM module realizes the compensation of the information loss by iteratively fusing multi-scale features.
Thus, the iterative fusion module pair fr using the first multi-scale features g b,K- 1 、f m,K - 2 、f m ,K- 3 and fm,K - 4 Performing space dimension iterative fusion to obtain decoding characteristicsComprising the following steps:
the feature encoding result fr g b,K- 1 First embedding more global features ASPP (f' rgb, K- 1 ) Wherein ASPP adopts expansion coefficient d=2, 4,8, the number of output characteristic channels is 256, and then ASPP (f' rgb, K- 1 ) Up-sampled feature frgb, uK-1 and fusion feature f containing rich texture information m,K - 2 、f m,K - 3 and fm,K - 4 ObtainingIn consideration of computational complexity, the above fusion feature utilizes one Conv unit to realize channel dimension reduction. Specifically, said->The formula of (2) is as follows:
wherein z is E [1,3 ]],For cascade operation, up (-) is Up-sampling, meanPool (-) is mean pooling operation, maxPool (-) is maximum pooling operation, sigmoid (-) is Sigmoid function, and · is point multiplication operation>To pair(s)Adaptive mask for spatial attention mechanism evaluation to indicate where and how much information compensation is needed,/for>For f by Conv unit pair m,K _z- 1 And performing channel dimension reduction to obtain the characteristics.
Likewise, the iterative fusion module pair f utilizing the second multi-scale features the ,K- 1 、f m ,K- 2 、f m,K - 3 and fm,K - 4 Performing space dimension iterative fusion to obtain decoding characteristicsComprising the following steps:
feature encoding result f the ,K- 1 First more global features ASPP (f 'are embedded by ASPP' the,K-1 ) Then iteratively fusing ASPP (f the,K-1 ) Upsampling features of (a)Fusion feature f containing rich texture information m,K - 2 、f m,K - 3 and fm,K - 4 Obtain->
S13, selecting one of the first semantic segmentation image and the second semantic segmentation image as a semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair.
Specifically, S13 includes:
under the condition that no texture information is missing in an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair;
and under the condition that texture information is missing in any one of an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair by a daytime scene, and taking the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair by a night scene.
The double-branch structure of the invention enables the RGB-T semantic segmentation model to better cope with the loss condition of the single-mode data signal. Table 3 is a comparison table of the results of the robustness experiments of signal loss on the road scene RGB-T image dataset MFNet, and experiments prove that the invention can obtain better semantic segmentation results under the condition of signal loss.
TABLE 3 Table 3
In a word, according to the RGB-T image semantic segmentation method provided by the invention, an RGB-T image semantic segmentation data set is used for training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches in advance to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.
In addition, on the basis of the scheme, similar effects can be achieved by using different characteristics to extract backbone networks and modifying parameters (such as the number of convolutional layers, the number of channels, an activation function and the like) in each module, and similar semi-supervised semantic segmentation training can be achieved by using different strong and weak data enhancement combinations.
In a second aspect, the present invention provides an RGB-T image semantic segmentation apparatus for description, where the RGB-T image semantic segmentation apparatus described below and the RGB-T image semantic segmentation method described above may be referred to correspondingly to each other. Fig. 8 is a schematic flow chart of an RGB-T image semantic segmentation apparatus provided by the present invention, as shown in fig. 8, the apparatus includes:
the calling module is used for calling the RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
the generation module is used for inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
The selecting module is used for selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
The invention provides an RGB-T image semantic segmentation device, which is characterized in that an RGB-T image semantic segmentation data set is utilized in advance to train a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches to obtain an RGB-T image semantic segmentation model; the dual-branch RGB-T semantic segmentation network (1) performs self-adaptive complementary fusion on RGB modal characteristics and thermal infrared modal characteristics of an input RGB-T image pair in a spatial cross-modal information fusion mode, and compensates for spatial information loss in a characteristic extraction stage in a multi-scale characteristic iterative fusion mode, so that the dual-branch RGB-T semantic segmentation network has the capacity of deeply mining cross-modal spatial complementary texture characteristics of the RGB-T image pair. (2) The dual-branch RGB-T semantic segmentation network can enable the RGB-T image semantic segmentation model to better cope with the loss condition of the single-mode data texture signal. (3) The RGB-T image semantic segmentation data set is obtained by carrying out data enhancement on the semi-labeling RGB-T image pair data set in an RGB mode random mask mode, and a new inter-mode space information complementary region is introduced, so that training of an RGB-T image semantic segmentation model is facilitated by fully utilizing labeling data. Therefore, the obtained RGB-T image semantic segmentation model can better utilize cross-modal data space complementary information, and compared with the existing RGB-T semantic segmentation technology, the semantic segmentation performance under the condition of better poor illumination can be achieved, the cost is lower, the labeling cost is lower, and the cost reduction and the synergy of the fine-granularity perception of the complex environment are facilitated. Inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; according to the condition that texture information of the target RGB-T image pair is missing, one of the first semantic segmentation image and the second semantic segmentation image is selected to serve as the semantic segmentation image of the target RGB-T image pair, so that the semantic segmentation result of the target RGB-T image pair can be more accurate.
In a third aspect, fig. 9 illustrates a physical schematic diagram of an electronic device, as shown in fig. 9, where the electronic device may include: processor 910, communication interface (Communications Interface), memory 930, and communication bus 940, wherein processor 910, communication interface 920, and memory 930 communicate with each other via communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a method of RGB-T image semantic segmentation, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
Further, the logic instructions in the memory 930 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a method of RGB-T image semantic segmentation provided by the above methods, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of semantic segmentation of RGB-T images provided by the above methods, the method comprising: calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set; inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch; selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair; the RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs; the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for semantic segmentation of RGB-T images, the method comprising:
calling an RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
The RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
2. The RGB-T image semantic segmentation method of claim 1, wherein the RGB branch comprises a first input layer, a first feature encoder, a first feature decoder, a first pixel-level classifier, and a main prediction output layer;
the thermal infrared branch comprises a second input layer, a second feature encoder, a second feature decoder, a second pixel level classifier and an auxiliary prediction output layer;
wherein the first feature encoder comprises K feature extraction layers, respectively denoted as L rgb,0 To L rgb,K-1 ;
The second feature encoder includes K feature extraction layers, each denoted as L the,0 To L the,K-1 ;
The first feature encoder and the second feature encoder jointly comprise K space cross-modal information fusion modules which are respectively recorded as SCF 0 To SCF K-1 ;
The first feature decoder comprises a first multi-scale feature iterative fusion module, and the second feature decoder comprises a second multi-scale feature iterative fusion module;
inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch, wherein the method comprises the following steps:
transmitting an RGB image of the target RGB-T image pair to the first feature encoder through the first input layer, and transmitting a thermal infrared image of the target RGB-T image pair to the second feature encoder through the second input layer;
utilizing the L in a combined structure of the first feature encoder and the second feature encoder rgb,i Feature extraction is carried out on the input RGB information to obtain f rgb,i By using the L the,i Feature extraction is carried out on the input thermal infrared information to obtain f the,i And utilize the SCF i For f rgb,i and fthe,i Performing spatial cross-modal information fusion to obtain f' rgb,i and f′the,i The method comprises the steps of carrying out a first treatment on the surface of the Wherein i is E [0, (K-1)]And when i=0, L rgb,i The RGB information input in (1) is the RGB image in the target RGB-T image pair, L rgb,i The thermal infrared information input in the process is the thermal infrared image in the target RGB-T image pair, when i is E [1, (K-1) ]When L rgb,i The RGB information input in the SCF is utilized i-1 The obtained f' rgb,i-1 ,L rgb,i The input thermal infrared information is the SCF i-1 The obtained f' the,i-1 ;
Iterative fusion of module pairs f 'using first multi-scale features' rgb,K-1 、f m,k-2 、f m,k-3 and fm,k-4 Performing space dimension iterative fusion to obtain decoding characteristicsIterative fusion of module pairs f 'using second multi-scale features' the,K-1 f m,k-2 、f m,k-3 and fm,k-4 Performing space dimension iterative fusion to obtain decoding characteristics +.>Wherein j is E [2,K ]],f m,k-j Is f rgb,k-j And f the,k-j The characteristics obtained after the additive fusion are carried out; />
Will be and />Performing additive fusion to obtain a first additive fusion feature, and processing the first additive fusion feature by using a first pixel-level classifier to obtain the first semantic segmentation image y rgb And outputting the first semantically segmented image y through the main prediction output layer rgb ;
3. The RGB-T image semantic segmentation method of claim 2, wherein the spatial cross-modality information fusion module comprises: a channel adaptive noise reducer, a space adaptive demand graph evaluator and a cross-modal fusion device;
the SCF is utilized i For f rgb,i and fthe,i Performing spatial cross-modal information fusion to obtain f' rgb,i and f’the,i Comprising:
in the channel adaptive noise reducer, for f rgb,i Performing maximum value pooling and mean value pooling treatment to obtain a first maximum value pooling characteristic MaxPool (f) rgb,i ) And a first averaged pooling feature MeanPool (f) rgb,i ) Based on the MaxPool (f rgb,i ) And the MeanPool (f) rgb,i ) Generating a first channel attention map using a channel attention mechanismAnd add said->And f is equal to rgb,i Is the product of f rgb,i Noise reduction features of->
At the same time, for f the,i Performing maximum value pooling and mean value pooling treatment to obtain a second maximum value pooling characteristic MaxPool (f) the,i ) And a second averaged pooling feature MeanPool (f) the,i ) Based on the MaxPool (f the,i ) And the MeanPool (f) the,i ) The channel attention mechanism generates a second channel attention mapAnd add said->And f is equal to the,i Is the product of f the,i Noise reduction features of->
In a spatially adaptive demand graph evaluator, based on and />Generating a first spatially adaptive demand graph using a spatial attention mechanism>At the same time based on-> and />Generating a second spatially adaptive demand graph using a spatial attention mechanism>
6. The RGB-T image semantic segmentation method of claim 2, wherein the first multi-scale feature iterative fusion module and the second multi-scale feature iterative fusion module each comprise: a spatial pyramid pooler and a spatial dimension fusion device;
the iterative fusion module pair f 'utilizing the first multi-scale features' rgb,K-1 、f m,K-2 、f m,K-3 and fm,K-4 Performing space dimension iterative fusion to obtain decoding characteristicsComprising the following steps:
iterative fusion using the first multi-scale featureGenerating the f 'by a space pyramid pooler in a synthesis module' rgb,K-1 Corresponding global features;
in the spatial dimension fusion device in the first multi-scale feature iterative fusion module, for the f' rgb,K-1 Upsampling features of (a)f m,K-2 、f m,K-3 and fm,K-4 Performing space dimension iterative fusion to obtain decoding characteristics +.>
The iterative fusion module pair f 'utilizing the second multi-scale features' the,K-1 、f m,K-2 、f m,K-3 and fm,K-4 Performing space dimension iterative fusion to obtain decoding characteristicsComprising the following steps:
determining the f 'by a spatial pyramid pooler in the second multi-scale feature iterative fusion module' the,K-1 Corresponding global features;
7. The RGB-T image semantic segmentation method of claim 6, wherein theThe formula of (2) is as follows:
wherein z is E [1,3 ]],For cascade operation, up (-) is Up-sampling, maxPool (-) is mean pooling operation, maxPool (-) is maximum pooling operation, sigmoid (-) is Sigmoid function, and · is dot multiplication operation>For->Adaptive mask obtained by performing spatial attention mechanism evaluation, < >>For->Adaptive mask obtained by performing spatial attention mechanism evaluation, < >>For f by Conv unit pair m,K-z-1 And performing channel dimension reduction to obtain characteristics, wherein the Conv unit comprises convolution operation, batch normalization operation and Relu activation.
8. The RGB-T image semantic segmentation method according to claim 1, characterized in that the RGB-T image semantic segmentation model has a loss functionThe expression of (2) is as follows:
wherein ,total training loss for RGB-T image pairs with pixel-level semantic segmentation labels, +.>The total training loss of the RGB-T image pair without pixel-level semantic segmentation labels is G is the RGB-T image pair with pixel-level semantic segmentation labels +.>Pixel-level semantic segmentation labeling of (h) rgb (. Cndot.) is the RGB branch of the RGB-T image semantic segmentation model, h the (. Cndot.) is the thermal infrared branch of the RGB-T image semantic segmentation model, CE (. Cndot.) represents the cross entropy loss function,>RGB-T image pair labeled for non-pixel level semantic segmentation, Y rgb Is->Corresponding pseudo tag, Y the Is->Corresponding pseudo tag, M is->Y is the semantic segmentation prediction corresponding to a single pixel point.
9. The RGB-T image semantic segmentation method according to any one of claims 1 to 8, wherein selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the lack of texture information of the target RGB-T image pair comprises:
Under the condition that no texture information is missing in an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair;
and under the condition that texture information is missing in any one of an RGB image and a thermal infrared image in the target RGB-T image pair, taking the first semantic segmentation image as a semantic segmentation image of the target RGB-T image pair by a daytime scene, and taking the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair by a night scene.
10. An RGB-T image semantic segmentation apparatus, the apparatus comprising:
the calling module is used for calling the RGB-T image semantic segmentation model; the RGB-T image semantic segmentation model is obtained by training a double-branch RGB-T semantic segmentation network comprising RGB branches and thermal infrared branches by utilizing an RGB-T image semantic segmentation data set;
the generation module is used for inputting a target RGB-T image pair into the RGB-T image semantic segmentation model to obtain a first semantic segmentation image output by the RGB branch and a second semantic segmentation image output by the thermal infrared branch;
the selecting module is used for selecting one of the first semantic segmentation image and the second semantic segmentation image as the semantic segmentation image of the target RGB-T image pair according to the condition of lack of texture information of the target RGB-T image pair;
The RGB-T image semantic segmentation data set is obtained by adopting an RGB image random mask mode to carry out data enhancement on the first data set; the first data set is obtained by performing pixel-level semantic segmentation labeling on part of RGB-T image pairs in a data set formed by RGB-T image pairs;
the dual-branch RGB-T semantic segmentation network deeply mines the cross-modal space complementary texture features of an input RGB-T image pair through space cross-modal information fusion and multi-scale feature iterative fusion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211715697.1A CN116091765A (en) | 2022-12-29 | 2022-12-29 | RGB-T image semantic segmentation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211715697.1A CN116091765A (en) | 2022-12-29 | 2022-12-29 | RGB-T image semantic segmentation method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116091765A true CN116091765A (en) | 2023-05-09 |
Family
ID=86209687
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211715697.1A Pending CN116091765A (en) | 2022-12-29 | 2022-12-29 | RGB-T image semantic segmentation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116091765A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117557795A (en) * | 2024-01-10 | 2024-02-13 | 吉林大学 | Underwater target semantic segmentation method and system based on multi-source data fusion |
-
2022
- 2022-12-29 CN CN202211715697.1A patent/CN116091765A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117557795A (en) * | 2024-01-10 | 2024-02-13 | 吉林大学 | Underwater target semantic segmentation method and system based on multi-source data fusion |
CN117557795B (en) * | 2024-01-10 | 2024-03-29 | 吉林大学 | Underwater target semantic segmentation method and system based on multi-source data fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112465828B (en) | Image semantic segmentation method and device, electronic equipment and storage medium | |
CN112634296B (en) | RGB-D image semantic segmentation method and terminal for gate mechanism guided edge information distillation | |
CN111932431B (en) | Visible watermark removing method based on watermark decomposition model and electronic equipment | |
CN110163188B (en) | Video processing and method, device and equipment for embedding target object in video | |
Fang et al. | Traffic accident detection via self-supervised consistency learning in driving scenarios | |
Hamdi et al. | A new image enhancement and super resolution technique for license plate recognition | |
JP7499402B2 (en) | End-to-End Watermarking System | |
CN115272437A (en) | Image depth estimation method and device based on global and local features | |
CN116091765A (en) | RGB-T image semantic segmentation method and device | |
Sheng et al. | A joint framework for underwater sequence images stitching based on deep neural network convolutional neural network | |
CN115631205A (en) | Method, device and equipment for image segmentation and model training | |
Wang et al. | STCD: efficient Siamese transformers-based change detection method for remote sensing images | |
CN111914850B (en) | Picture feature extraction method, device, server and medium | |
CN117078574A (en) | Image rain removing method and device | |
CN116797975A (en) | Video segmentation method, device, computer equipment and storage medium | |
CN112950501B (en) | Noise field-based image noise reduction method, device, equipment and storage medium | |
CN116263943A (en) | Image restoration method and equipment and electronic device | |
CN111325068B (en) | Video description method and device based on convolutional neural network | |
Chen et al. | Exploring efficient and effective generative adversarial network for thermal infrared image colorization | |
Guo et al. | A Markov random field model for the restoration of foggy images | |
Kumar et al. | Encoder–decoder-based CNN model for detection of object removal by image inpainting | |
Lin et al. | Spatio-temporal co-attention fusion network for video splicing localization | |
CN116821699B (en) | Perception model training method and device, electronic equipment and storage medium | |
Yuan et al. | Traffic scene depth analysis based on depthwise separable convolutional neural network | |
CN116363369A (en) | Image segmentation model optimization method and device, medium and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |