CN111062395B - Real-time video semantic segmentation method - Google Patents
Real-time video semantic segmentation method Download PDFInfo
- Publication number
- CN111062395B CN111062395B CN201911185021.4A CN201911185021A CN111062395B CN 111062395 B CN111062395 B CN 111062395B CN 201911185021 A CN201911185021 A CN 201911185021A CN 111062395 B CN111062395 B CN 111062395B
- Authority
- CN
- China
- Prior art keywords
- semantic segmentation
- key frame
- training
- layer
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the field of computer vision, and relates to a real-time video semantic segmentation method. The method comprises the following steps: step 1: selecting a training and testing data set; step 2: constructing a backbone network based on images; and step 3: pre-training a backbone network by using a training data set; and 4, step 4: constructing an integral video semantic segmentation model; and 5: training the whole video semantic segmentation model on a training data set; step 6: and inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end. The method has high reasoning speed and can meet the requirement of real-time performance; the accuracy is high, the video semantic segmentation can be accurately realized, and the practicability is very strong.
Description
Technical Field
The invention belongs to the field of computer vision, and relates to a real-time video semantic segmentation method.
Background
Semantic segmentation is a basic task in the field of computer vision, which aims at predicting a semantic label for each pixel in a given image. Inspired by deep learning, the task has a brand-new development direction, and particularly, the full convolution network is provided, so that the image semantic segmentation effect reaches a new milestone-like height. Video semantic segmentation tends to be more complex due to one more time dimension than image information and the presence of a large amount of redundant information.
It is time consuming to directly segment each frame in the video by the semantic image-based segmentation method, and the correlation between frames cannot be fully utilized, so that satisfactory performance cannot be obtained. Existing video semantic segmentation methods can be roughly classified according to how temporal information is utilized, and mainly include a method of encoding motion and structural features using 3D convolution, a method of summarizing frame-by-frame information using a recurrent neural network, a method of modeling spatial and temporal contexts using CRF, and a method of calculating optical flow and propagating features using an independent network. However, the 3D convolution based method can be regarded as a way of information aggregation, and the information of the whole video segment is used as input, so the processing efficiency is not high, and the cyclic neural network based method has similar disadvantages. The CRF-based approach requires high computational costs due to the complex reasoning of CRF. Optical flow-based methods are difficult to achieve accurate optical flow estimation, time consuming and always subject to misalignment. Most existing methods are slow to process video frames and cannot achieve real-time, which is necessary in many practical applications of video semantic segmentation, such as automatic driving and intelligent monitoring.
In summary, the current video semantic segmentation method needs to fully utilize inter-frame consistency, reduce information redundancy between adjacent video frames, and further save inference time.
Disclosure of Invention
The invention aims to solve the problem of low reasoning speed of video semantic segmentation in the prior art, and provides a real-time video semantic segmentation method.
The working principle and the process of the invention are as follows: in order to solve the existing problems, firstly, a powerful backbone network which is light, efficient and real-time and takes an image-based network as an integral video semantic segmentation method is provided. The backbone network adopts an encoder-decoder architecture, and a residual double-branch depth separable convolution module (RDDS module) is proposed in an encoder so as to effectively capture detail information and effectively reduce the calculation amount. To enable feature propagation, a key frame selection mechanism is employed and a unique global attention module is proposed to indicate the spatial correlation between non-key frames and their previous key frames. More specifically, we use our proposed attention-based feature propagation architecture to build real-time full-convolution networks. First, the input frame is divided into key frames and non-key frames according to a fixed key frame selection mechanism. For the key frame, the whole backbone network is adopted to extract rich spatial information at multiple levels for feature propagation. The non-key frame does not need to waste a large amount of time to extract redundant features through the whole backbone network, but only needs to extract low-level features through a low-level network of the backbone network, reserves space details, and then fuses the low-level features and the high-level features of the previous key frame obtained by propagation and multiplied by attention weight. In order to achieve dissemination efficiently, the invention proposes an attention-based method: taking the low-level feature maps of the non-key frames and the corresponding key frames as input, and calculating the spatial similarity between any two positions of the feature maps to obtain an overall attention map A, wherein the value of each position in A represents the correlation between the key frames and the corresponding positions of the non-key frames. Since the overall attention strives to integrate per-pixel correlation between two frames, it can be considered as a spatial transformation guide to capture inter-frame consistency information. The high-level features of the predicted non-key frames are obtained by applying attention weights to the high-level features of the corresponding key frames, which are then fused with the low-level functions of the non-key frames to supplement new information that was not present in the previous key frames, thereby enhancing the ability to process complex and changing scenes. The proposed model is guided and end-to-end trained.
The purpose of the invention is realized by the following technical scheme.
A real-time video semantic segmentation method comprises the following steps:
step 2, constructing a backbone network based on images;
step 3, pre-training the backbone network by using a training data set;
step 4, constructing an integral video semantic segmentation model;
step 5, training the whole video semantic segmentation model on a training data set;
and 6, inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end.
The image-based backbone network described in step 2 employs an encoder-decoder architecture. The encoder comprises a residual two-branch depth separable convolution module (RDDS module) and a downsampling module, wherein the residual two-branch depth separable convolution module comprises two symmetric branches, each branch comprises 1 depth separable convolution layer, 1 depth separable expanded convolution layer, 3 batch normalization (batch normalization) layers, 1 linear rectification function (ReLU) active layer and 1 drop (drop) layer, and the two branch results are connected (Concat) and then pass through a convolution layer of 1 × 1 convolution kernel and a ReLU active layer. The down-sampling module is composed of a maximum pooling layer and a convolution layer Concat of 3 × 3 convolution kernel. The decoder contains a convolutional layer of 1 x 1 convolutional kernels and an 8-fold bilinear upsampled layer.
The step 3 comprises the following steps:
step 3.1: pre-processing and data enhancing the images in the training dataset, resizing the images to a fixed value, using data enhancement modes of horizontal flipping, translation and color (including brightness, saturation and contrast) variation;
step 3.2: initializing a whole image semantic segmentation model;
step 3.3: and recording the cross entropy loss of the semantic segmentation result predicted by the model and the labeled image in the training process as loss.
The overall video semantic segmentation model in the step 4 is based on a key frame selection mechanism, and firstly, a second down-sampling layer is taken as a boundary to divide an encoder part of a backbone network into a low-layer network part and a high-layer network part. If the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtainedlkAnd a high level feature map Fhk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map FlnThen using the global attention module to effect feature propagation, FlnAnd F of the previous key framelkObtaining an overall attention diagram A by matrix multiplication as an input, and then taking F of a previous key framehkMultiplying A to obtain the predicted high-level feature map F of the current non-key framehnWill FhnAnd FlnAdded to supplement the detail information. In both cases, the final semantic segmentation result is finally obtained by the decoder.
The step 5 comprises the following steps:
step 5.1, preprocessing and data enhancing are carried out on the images in the training data set, the images are redefined into fixed values, and a data enhancing mode of horizontal turning, translation and color (including brightness, saturation and contrast) change is used;
step 5.2, loading model parameters pre-trained by a backbone network, initializing the whole video semantic segmentation model, inputting a key frame-non-key frame image pair each time, wherein each continuous video segment comprises 1 key frame and n non-key frames;
and 5.3, recording the cross entropy loss of the semantic segmentation result of the model for predicting the non-key frame and the labeled image in the training process as loss.
And 3, performing error back propagation by using a random shaving reduction algorithm according to the loss, and updating model parameters by using a polynomial learning strategy to obtain a trained semantic segmentation model. In the polynomial learning strategy, the learning rate lr is set as:
where baselr is the initial learning rate and step 3 is set to 5e-4Step 5 is set to 3e-3(ii) a The power setting is 0.9.
Advantageous effects
Compared with the prior art, the invention has the following remarkable advantages:
(1) the method has high reasoning speed and can meet the requirement of real-time performance;
(2) the method has high accuracy, can accurately realize video semantic segmentation, and has practicability.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a backbone network architecture of the present invention;
FIG. 3 is a block diagram of the residual two-branch depth separable convolution module of the present invention;
FIG. 4 is a block diagram of a downsampling module of the present invention;
FIG. 5 is a diagram of a video semantic segmentation model architecture of the present invention;
FIG. 6 is a block diagram of the overall attention module of the present invention;
FIG. 7 is a partial example of the present invention on a Cityscapes dataset.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific embodiments and the accompanying drawings.
The present invention will now be described more fully hereinafter with particular reference to a preferred embodiment.
As shown in fig. 1, the real-time video semantic segmentation method of the present invention includes the following steps:
And 2, constructing a backbone network based on the image.
As shown in fig. 2, the backbone network employs an encoder-decoder architecture. The encoder comprises a residual two-branch depth separable convolution module (RDDS module) and a down-sampling module, and the decoder comprises a convolution layer of 1 x 1 convolution kernel and an 8-fold bilinear up-sampling layer.
As shown in fig. 3, the RDDS module includes two symmetric branches, each branch includes 1 depth-separable convolutional layer, 1 depth-separable expanded convolutional layer, 3 batch normalization (batch normalization) layers, 1 linear rectifying function (ReLU) active layer, and 1 drop (dropout) layer, and two branch results are concatenated (Concat) and then passed through a 1 × 1 convolutional core convolutional layer and a ReLU active layer. The RDDS module can effectively capture detailed information and effectively reduce the amount of computation.
As shown in fig. 4, the downsampling module is configured to perform downsampling operation to extract features, and is configured by a convolution layer Concat of one maximum pooling layer and one 3 × 3 convolution kernel.
The specific network structure of the backbone network is shown in table 1:
TABLE 1 RDDS Module network architecture
And 3, pre-training the backbone network by using a Cityscapes and Cityscapes sequence training data set.
The process of step 3 is:
step 3.1, preprocessing and data enhancing are carried out on the images in the training data set, the images are redefined into fixed values, and a data enhancing mode of horizontal turning, translation and color (including brightness, saturation and contrast) change is used;
step 3.2, initializing the whole image semantic segmentation model;
and 3.3, recording the cross entropy loss of the semantic segmentation result predicted by the model and the labeled image in the training process as loss.
And 4, constructing an integral video semantic segmentation model.
As shown in fig. 5, the overall video semantic segmentation model is based on a key frame selection mechanism, and first divides the encoder portion of the backbone network into a lower layer network and an upper layer network, with the second downsampling layer as a boundary. If the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtainedlkAnd a high level feature map Fhk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map FlnThen using the global attention module to effect feature propagation, FlnAnd F of the previous key framelkObtaining an overall attention diagram A by matrix multiplication as an input, and then taking F of a previous key framehkMultiplying A to obtain the predicted high-level feature map F of the current non-key framehnWill FhnAnd FlnAdded to supplement the detail information. In both cases, the final semantic segmentation result is finally obtained by the decoder.
As shown in fig. 6, in order to deeply mine the spatial correlation between the low-level feature maps of the key frame and the non-key frame and implement feature propagation, an overall attention module is designed, and the attention map calculated by the module implicitly contains inter-frame consistency information and can be regarded as guiding information of feature propagation. The calculation process in the global attention module is as follows:
(1) after the calculation of a second down-sampling module of a backbone network encoder, the low-level feature maps of the key frames and the non-key frames are obtainedReduce the number of channels between the two channels, and reduce FlkAfter being rotated with FlnObtaining the graph by matrix multiplication(N=H×W);
(2) We then input a 'into two parallel branches to get the maximum point-to-point response, average and maximum pooling a' over the channel dimension, respectively, and then sum the results Concat of the two branches to get the maximum response attention map of channel number 2.
(3) Finally, we reduce the number of channels to 1 using a 5 × 5 convolutional layer, and then limit the value to [ -1, using a Sigmoid active layer]Finally, get the attention map
And 5, training the whole video semantic segmentation model on a Cityscapes sequence training data set.
The process of step 5 is:
step 5.1, preprocessing and data enhancing are carried out on the images in the training data set, the images are redefined into fixed values, and a data enhancing mode of horizontal turning, translation and color (including brightness, saturation and contrast) change is used;
step 5.2, loading model parameters pre-trained by a backbone network, initializing the whole video semantic segmentation model, inputting a key frame-non-key frame image pair each time, wherein each continuous video segment comprises 1 key frame and n non-key frames;
and 5.3, recording the cross entropy loss of the semantic segmentation result of the model for predicting the non-key frame and the labeled image in the training process as loss.
And 3, performing error back propagation by using a random shaving reduction algorithm according to the loss, and updating model parameters by using a polynomial learning strategy to obtain a trained semantic segmentation model. In the polynomial learning strategy, the learning rate lr is set as:
where baselr is the initial learning rate and step 3 is set to 5e-4Step 5 is set to 3e-3(ii) a The power setting is 0.9.
And 6, inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end.
Table 2 shows a comparison between the accuracy (mlio u) and the inference speed of the video semantic segmentation method and other most advanced methods, which means that the method can greatly increase the inference speed while maintaining a high accuracy, and when the mlio u is 60.6%, the inference speed can reach 131.6 fps:
TABLE 2 comparison of this video semantic segmentation method with other most advanced methods
FIG. 7 shows a partial example of the present invention on a Cityscapes dataset.
Claims (4)
1. A real-time video semantic segmentation method is characterized by comprising the following steps:
step 1, selecting a training and testing data set;
step 2, constructing a backbone network based on images;
the backbone network adopts an encoder-decoder architecture, an encoder comprises a residual error double-branch depth separable convolution module RDDS module and a downsampling module, wherein the RDDS module comprises two symmetrical branches, each branch comprises 1 depth separable convolution layer, 1 depth separable expansion convolution layer, 3 Batch Normalization layers, 1 ReLU activation layer and 1 dropout layer, and the two branches after Concat pass through a convolution layer of 1 x 1 convolution kernel and a ReLU activation layer; the down-sampling module is composed of a maximum pooling layer and a convolution layer Concat of 3 multiplied by 3 convolution kernel; the decoder comprises a convolution layer of 1 x 1 convolution kernel and an 8 times bilinear upsampling layer;
step 3, pre-training the backbone network by using a training data set;
step 4, constructing an integral video semantic segmentation model;
the integral video semantic segmentation model is based on a key frame selection mechanism; firstly, dividing an encoder part of a backbone network into a lower layer network part and a higher layer network part by taking a second downsampling layer as a boundary; if the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtainedlkAnd a high level feature map Fhk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map FlnThen using the global attention module to effect feature propagation, FlnAnd F of the previous key framelkAs an input, the calculation process in the global attention module is as follows:
(1) after the calculation of a second down-sampling module of a backbone network encoder, the low-level feature maps of the key frames and the non-key frames are obtainedReduce the number of channels between the two channels, and reduce FlkAfter being rotated with FlnObtaining the graph by matrix multiplicationN=H×W;
(2) Inputting A 'into two parallel branches to obtain the maximum point-to-point response, respectively carrying out average pooling and maximum pooling on A' in a channel dimension, and then adding results Concat of the two branches to obtain a maximum response attention diagram with the channel number being 2;
(3) finally, 5 × 5 convolutional layers are used to reduce the number of channels to 1, and then Sigmoid active layers are used to limit the value to [ -1, 1]Finally, get the attention map
F of the previous key framehkMultiplying A to obtain the predicted high-level feature map F of the current non-key framehnWill FhnAnd FlnAdding to supplement the detail information; in both cases, a final semantic segmentation result is finally obtained through a decoder;
step 5, training the whole video semantic segmentation model on a training data set;
and 6, inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end.
2. The method according to claim 1, wherein step 3 comprises:
step 3.1: preprocessing and data enhancing images in a training data set, redefining the images into fixed values, and using a data enhancing mode of horizontal turning, translation and color change;
step 3.2: initializing a whole image semantic segmentation model;
step 3.3: and recording the cross entropy loss of the semantic segmentation result predicted by the model and the labeled image in the training process as loss.
3. The method according to claim 1, wherein the step 5 comprises:
step 5.1: preprocessing and data enhancing images in a training data set, redefining the images into fixed values, and using a data enhancing mode of horizontal turning, translation and color change;
step 5.2: loading pre-trained model parameters of a backbone network, initializing a whole video semantic segmentation model, inputting a key frame-non-key frame image pair each time, wherein each continuous video segment comprises 1 key frame and n non-key frames;
step 5.3: and recording the cross entropy loss of the semantic segmentation result of the model for predicting the non-key frame and the labeled image in the training process as loss.
4. The real-time video semantic segmentation method according to claim 2 or 3, characterized in that for step 3 and step 5, a stochastic gradient descent algorithm is used for error back propagation according to loss, and a polynomial learning strategy is used to update model parameters to obtain a trained semantic segmentation model; in the polynomial learning strategy, the learning rate lr is set as:
where baselr is the initial learning rate and step 3 is set to 5e-4Step 5 is set to 3e-3(ii) a power is set to 0.9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911185021.4A CN111062395B (en) | 2019-11-27 | 2019-11-27 | Real-time video semantic segmentation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911185021.4A CN111062395B (en) | 2019-11-27 | 2019-11-27 | Real-time video semantic segmentation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111062395A CN111062395A (en) | 2020-04-24 |
CN111062395B true CN111062395B (en) | 2020-12-18 |
Family
ID=70299046
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911185021.4A Active CN111062395B (en) | 2019-11-27 | 2019-11-27 | Real-time video semantic segmentation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111062395B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112651421B (en) * | 2020-09-04 | 2024-05-28 | 江苏濠汉信息技术有限公司 | Infrared thermal imaging power transmission line anti-external-damage monitoring system and modeling method thereof |
CN112364822B (en) * | 2020-11-30 | 2022-08-19 | 重庆电子工程职业学院 | Automatic driving video semantic segmentation system and method |
CN112862839B (en) * | 2021-02-24 | 2022-12-23 | 清华大学 | Method and system for enhancing robustness of semantic segmentation of map elements |
CN113177478B (en) * | 2021-04-29 | 2022-08-05 | 西华大学 | Short video semantic annotation method based on transfer learning |
CN113505680B (en) * | 2021-07-02 | 2022-07-15 | 兰州理工大学 | Content-based bad content detection method for high-duration complex scene video |
CN113658189B (en) * | 2021-09-01 | 2022-03-11 | 北京航空航天大学 | Cross-scale feature fusion real-time semantic segmentation method and system |
CN116246075B (en) * | 2023-05-12 | 2023-07-21 | 武汉纺织大学 | Video semantic segmentation method combining dynamic information and static information |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108229336A (en) * | 2017-12-13 | 2018-06-29 | 北京市商汤科技开发有限公司 | Video identification and training method and device, electronic equipment, program and medium |
CN109241972A (en) * | 2018-08-20 | 2019-01-18 | 电子科技大学 | Image, semantic dividing method based on deep learning |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9436876B1 (en) * | 2014-12-19 | 2016-09-06 | Amazon Technologies, Inc. | Video segmentation techniques |
CN108235116B (en) * | 2017-12-27 | 2020-06-16 | 北京市商汤科技开发有限公司 | Feature propagation method and apparatus, electronic device, and medium |
CN109919044A (en) * | 2019-02-18 | 2019-06-21 | 清华大学 | The video semanteme dividing method and device of feature propagation are carried out based on prediction |
CN110147763B (en) * | 2019-05-20 | 2023-02-24 | 哈尔滨工业大学 | Video semantic segmentation method based on convolutional neural network |
-
2019
- 2019-11-27 CN CN201911185021.4A patent/CN111062395B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108229336A (en) * | 2017-12-13 | 2018-06-29 | 北京市商汤科技开发有限公司 | Video identification and training method and device, electronic equipment, program and medium |
CN109241972A (en) * | 2018-08-20 | 2019-01-18 | 电子科技大学 | Image, semantic dividing method based on deep learning |
Non-Patent Citations (1)
Title |
---|
视觉注意力检测综述;王文冠 等;《软件学报》;20190215;第30卷(第2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111062395A (en) | 2020-04-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111062395B (en) | Real-time video semantic segmentation method | |
CN112634276B (en) | Lightweight semantic segmentation method based on multi-scale visual feature extraction | |
CN110059772B (en) | Remote sensing image semantic segmentation method based on multi-scale decoding network | |
CN110276354B (en) | High-resolution streetscape picture semantic segmentation training and real-time segmentation method | |
CN110569851B (en) | Real-time semantic segmentation method for gated multi-layer fusion | |
CN111696110B (en) | Scene segmentation method and system | |
CN111832453B (en) | Unmanned scene real-time semantic segmentation method based on two-way deep neural network | |
CN111652081B (en) | Video semantic segmentation method based on optical flow feature fusion | |
CN111563507A (en) | Indoor scene semantic segmentation method based on convolutional neural network | |
CN113066089B (en) | Real-time image semantic segmentation method based on attention guide mechanism | |
CN111476133B (en) | Unmanned driving-oriented foreground and background codec network target extraction method | |
CN114565770A (en) | Image segmentation method and system based on edge auxiliary calculation and mask attention | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN113870160A (en) | Point cloud data processing method based on converter neural network | |
CN115393289A (en) | Tumor image semi-supervised segmentation method based on integrated cross pseudo label | |
CN115496919A (en) | Hybrid convolution-transformer framework based on window mask strategy and self-supervision method | |
CN116486080A (en) | Lightweight image semantic segmentation method based on deep learning | |
CN115830575A (en) | Transformer and cross-dimension attention-based traffic sign detection method | |
CN115995002B (en) | Network construction method and urban scene real-time semantic segmentation method | |
CN110942463B (en) | Video target segmentation method based on generation countermeasure network | |
US20240062347A1 (en) | Multi-scale fusion defogging method based on stacked hourglass network | |
CN115909465A (en) | Face positioning detection method, image processing device and readable storage medium | |
CN116310324A (en) | Pyramid cross-layer fusion decoder based on semantic segmentation | |
CN116452472A (en) | Low-illumination image enhancement method based on semantic knowledge guidance | |
CN113313721B (en) | Real-time semantic segmentation method based on multi-scale structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |