CN111062395B - Real-time video semantic segmentation method - Google Patents

Real-time video semantic segmentation method Download PDF

Info

Publication number
CN111062395B
CN111062395B CN201911185021.4A CN201911185021A CN111062395B CN 111062395 B CN111062395 B CN 111062395B CN 201911185021 A CN201911185021 A CN 201911185021A CN 111062395 B CN111062395 B CN 111062395B
Authority
CN
China
Prior art keywords
semantic segmentation
key frame
training
layer
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911185021.4A
Other languages
Chinese (zh)
Other versions
CN111062395A (en
Inventor
赵三元
吴俊蓉
文宗正
黄科乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201911185021.4A priority Critical patent/CN111062395B/en
Publication of CN111062395A publication Critical patent/CN111062395A/en
Application granted granted Critical
Publication of CN111062395B publication Critical patent/CN111062395B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer vision, and relates to a real-time video semantic segmentation method. The method comprises the following steps: step 1: selecting a training and testing data set; step 2: constructing a backbone network based on images; and step 3: pre-training a backbone network by using a training data set; and 4, step 4: constructing an integral video semantic segmentation model; and 5: training the whole video semantic segmentation model on a training data set; step 6: and inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end. The method has high reasoning speed and can meet the requirement of real-time performance; the accuracy is high, the video semantic segmentation can be accurately realized, and the practicability is very strong.

Description

Real-time video semantic segmentation method
Technical Field
The invention belongs to the field of computer vision, and relates to a real-time video semantic segmentation method.
Background
Semantic segmentation is a basic task in the field of computer vision, which aims at predicting a semantic label for each pixel in a given image. Inspired by deep learning, the task has a brand-new development direction, and particularly, the full convolution network is provided, so that the image semantic segmentation effect reaches a new milestone-like height. Video semantic segmentation tends to be more complex due to one more time dimension than image information and the presence of a large amount of redundant information.
It is time consuming to directly segment each frame in the video by the semantic image-based segmentation method, and the correlation between frames cannot be fully utilized, so that satisfactory performance cannot be obtained. Existing video semantic segmentation methods can be roughly classified according to how temporal information is utilized, and mainly include a method of encoding motion and structural features using 3D convolution, a method of summarizing frame-by-frame information using a recurrent neural network, a method of modeling spatial and temporal contexts using CRF, and a method of calculating optical flow and propagating features using an independent network. However, the 3D convolution based method can be regarded as a way of information aggregation, and the information of the whole video segment is used as input, so the processing efficiency is not high, and the cyclic neural network based method has similar disadvantages. The CRF-based approach requires high computational costs due to the complex reasoning of CRF. Optical flow-based methods are difficult to achieve accurate optical flow estimation, time consuming and always subject to misalignment. Most existing methods are slow to process video frames and cannot achieve real-time, which is necessary in many practical applications of video semantic segmentation, such as automatic driving and intelligent monitoring.
In summary, the current video semantic segmentation method needs to fully utilize inter-frame consistency, reduce information redundancy between adjacent video frames, and further save inference time.
Disclosure of Invention
The invention aims to solve the problem of low reasoning speed of video semantic segmentation in the prior art, and provides a real-time video semantic segmentation method.
The working principle and the process of the invention are as follows: in order to solve the existing problems, firstly, a powerful backbone network which is light, efficient and real-time and takes an image-based network as an integral video semantic segmentation method is provided. The backbone network adopts an encoder-decoder architecture, and a residual double-branch depth separable convolution module (RDDS module) is proposed in an encoder so as to effectively capture detail information and effectively reduce the calculation amount. To enable feature propagation, a key frame selection mechanism is employed and a unique global attention module is proposed to indicate the spatial correlation between non-key frames and their previous key frames. More specifically, we use our proposed attention-based feature propagation architecture to build real-time full-convolution networks. First, the input frame is divided into key frames and non-key frames according to a fixed key frame selection mechanism. For the key frame, the whole backbone network is adopted to extract rich spatial information at multiple levels for feature propagation. The non-key frame does not need to waste a large amount of time to extract redundant features through the whole backbone network, but only needs to extract low-level features through a low-level network of the backbone network, reserves space details, and then fuses the low-level features and the high-level features of the previous key frame obtained by propagation and multiplied by attention weight. In order to achieve dissemination efficiently, the invention proposes an attention-based method: taking the low-level feature maps of the non-key frames and the corresponding key frames as input, and calculating the spatial similarity between any two positions of the feature maps to obtain an overall attention map A, wherein the value of each position in A represents the correlation between the key frames and the corresponding positions of the non-key frames. Since the overall attention strives to integrate per-pixel correlation between two frames, it can be considered as a spatial transformation guide to capture inter-frame consistency information. The high-level features of the predicted non-key frames are obtained by applying attention weights to the high-level features of the corresponding key frames, which are then fused with the low-level functions of the non-key frames to supplement new information that was not present in the previous key frames, thereby enhancing the ability to process complex and changing scenes. The proposed model is guided and end-to-end trained.
The purpose of the invention is realized by the following technical scheme.
A real-time video semantic segmentation method comprises the following steps:
step 1, selecting a training and testing data set;
step 2, constructing a backbone network based on images;
step 3, pre-training the backbone network by using a training data set;
step 4, constructing an integral video semantic segmentation model;
step 5, training the whole video semantic segmentation model on a training data set;
and 6, inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end.
The image-based backbone network described in step 2 employs an encoder-decoder architecture. The encoder comprises a residual two-branch depth separable convolution module (RDDS module) and a downsampling module, wherein the residual two-branch depth separable convolution module comprises two symmetric branches, each branch comprises 1 depth separable convolution layer, 1 depth separable expanded convolution layer, 3 batch normalization (batch normalization) layers, 1 linear rectification function (ReLU) active layer and 1 drop (drop) layer, and the two branch results are connected (Concat) and then pass through a convolution layer of 1 × 1 convolution kernel and a ReLU active layer. The down-sampling module is composed of a maximum pooling layer and a convolution layer Concat of 3 × 3 convolution kernel. The decoder contains a convolutional layer of 1 x 1 convolutional kernels and an 8-fold bilinear upsampled layer.
The step 3 comprises the following steps:
step 3.1: pre-processing and data enhancing the images in the training dataset, resizing the images to a fixed value, using data enhancement modes of horizontal flipping, translation and color (including brightness, saturation and contrast) variation;
step 3.2: initializing a whole image semantic segmentation model;
step 3.3: and recording the cross entropy loss of the semantic segmentation result predicted by the model and the labeled image in the training process as loss.
The overall video semantic segmentation model in the step 4 is based on a key frame selection mechanism, and firstly, a second down-sampling layer is taken as a boundary to divide an encoder part of a backbone network into a low-layer network part and a high-layer network part. If the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtainedlkAnd a high level feature map Fhk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map FlnThen using the global attention module to effect feature propagation, FlnAnd F of the previous key framelkObtaining an overall attention diagram A by matrix multiplication as an input, and then taking F of a previous key framehkMultiplying A to obtain the predicted high-level feature map F of the current non-key framehnWill FhnAnd FlnAdded to supplement the detail information. In both cases, the final semantic segmentation result is finally obtained by the decoder.
The step 5 comprises the following steps:
step 5.1, preprocessing and data enhancing are carried out on the images in the training data set, the images are redefined into fixed values, and a data enhancing mode of horizontal turning, translation and color (including brightness, saturation and contrast) change is used;
step 5.2, loading model parameters pre-trained by a backbone network, initializing the whole video semantic segmentation model, inputting a key frame-non-key frame image pair each time, wherein each continuous video segment comprises 1 key frame and n non-key frames;
and 5.3, recording the cross entropy loss of the semantic segmentation result of the model for predicting the non-key frame and the labeled image in the training process as loss.
And 3, performing error back propagation by using a random shaving reduction algorithm according to the loss, and updating model parameters by using a polynomial learning strategy to obtain a trained semantic segmentation model. In the polynomial learning strategy, the learning rate lr is set as:
Figure BDA0002292180050000041
where baselr is the initial learning rate and step 3 is set to 5e-4Step 5 is set to 3e-3(ii) a The power setting is 0.9.
Advantageous effects
Compared with the prior art, the invention has the following remarkable advantages:
(1) the method has high reasoning speed and can meet the requirement of real-time performance;
(2) the method has high accuracy, can accurately realize video semantic segmentation, and has practicability.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a backbone network architecture of the present invention;
FIG. 3 is a block diagram of the residual two-branch depth separable convolution module of the present invention;
FIG. 4 is a block diagram of a downsampling module of the present invention;
FIG. 5 is a diagram of a video semantic segmentation model architecture of the present invention;
FIG. 6 is a block diagram of the overall attention module of the present invention;
FIG. 7 is a partial example of the present invention on a Cityscapes dataset.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific embodiments and the accompanying drawings.
The present invention will now be described more fully hereinafter with particular reference to a preferred embodiment.
As shown in fig. 1, the real-time video semantic segmentation method of the present invention includes the following steps:
step 1, selecting a training and testing data set; in this embodiment, 20 types (of which type 1 is a background) of the cityscape dataset are used as a reference, the cityscape (single frame image) and cityscape sequence (continuous video frame) datasets are used in training a backbone network, only the cityscape sequence dataset is used in training an overall video semantic segmentation model, and the cityscape test dataset is used in testing.
And 2, constructing a backbone network based on the image.
As shown in fig. 2, the backbone network employs an encoder-decoder architecture. The encoder comprises a residual two-branch depth separable convolution module (RDDS module) and a down-sampling module, and the decoder comprises a convolution layer of 1 x 1 convolution kernel and an 8-fold bilinear up-sampling layer.
As shown in fig. 3, the RDDS module includes two symmetric branches, each branch includes 1 depth-separable convolutional layer, 1 depth-separable expanded convolutional layer, 3 batch normalization (batch normalization) layers, 1 linear rectifying function (ReLU) active layer, and 1 drop (dropout) layer, and two branch results are concatenated (Concat) and then passed through a 1 × 1 convolutional core convolutional layer and a ReLU active layer. The RDDS module can effectively capture detailed information and effectively reduce the amount of computation.
As shown in fig. 4, the downsampling module is configured to perform downsampling operation to extract features, and is configured by a convolution layer Concat of one maximum pooling layer and one 3 × 3 convolution kernel.
The specific network structure of the backbone network is shown in table 1:
TABLE 1 RDDS Module network architecture
Figure BDA0002292180050000051
And 3, pre-training the backbone network by using a Cityscapes and Cityscapes sequence training data set.
The process of step 3 is:
step 3.1, preprocessing and data enhancing are carried out on the images in the training data set, the images are redefined into fixed values, and a data enhancing mode of horizontal turning, translation and color (including brightness, saturation and contrast) change is used;
step 3.2, initializing the whole image semantic segmentation model;
and 3.3, recording the cross entropy loss of the semantic segmentation result predicted by the model and the labeled image in the training process as loss.
And 4, constructing an integral video semantic segmentation model.
As shown in fig. 5, the overall video semantic segmentation model is based on a key frame selection mechanism, and first divides the encoder portion of the backbone network into a lower layer network and an upper layer network, with the second downsampling layer as a boundary. If the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtainedlkAnd a high level feature map Fhk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map FlnThen using the global attention module to effect feature propagation, FlnAnd F of the previous key framelkObtaining an overall attention diagram A by matrix multiplication as an input, and then taking F of a previous key framehkMultiplying A to obtain the predicted high-level feature map F of the current non-key framehnWill FhnAnd FlnAdded to supplement the detail information. In both cases, the final semantic segmentation result is finally obtained by the decoder.
As shown in fig. 6, in order to deeply mine the spatial correlation between the low-level feature maps of the key frame and the non-key frame and implement feature propagation, an overall attention module is designed, and the attention map calculated by the module implicitly contains inter-frame consistency information and can be regarded as guiding information of feature propagation. The calculation process in the global attention module is as follows:
(1) after the calculation of a second down-sampling module of a backbone network encoder, the low-level feature maps of the key frames and the non-key frames are obtained
Figure BDA0002292180050000061
Reduce the number of channels between the two channels, and reduce FlkAfter being rotated with FlnObtaining the graph by matrix multiplication
Figure BDA0002292180050000062
(N=H×W);
(2) We then input a 'into two parallel branches to get the maximum point-to-point response, average and maximum pooling a' over the channel dimension, respectively, and then sum the results Concat of the two branches to get the maximum response attention map of channel number 2.
(3) Finally, we reduce the number of channels to 1 using a 5 × 5 convolutional layer, and then limit the value to [ -1, using a Sigmoid active layer]Finally, get the attention map
Figure BDA0002292180050000071
And 5, training the whole video semantic segmentation model on a Cityscapes sequence training data set.
The process of step 5 is:
step 5.1, preprocessing and data enhancing are carried out on the images in the training data set, the images are redefined into fixed values, and a data enhancing mode of horizontal turning, translation and color (including brightness, saturation and contrast) change is used;
step 5.2, loading model parameters pre-trained by a backbone network, initializing the whole video semantic segmentation model, inputting a key frame-non-key frame image pair each time, wherein each continuous video segment comprises 1 key frame and n non-key frames;
and 5.3, recording the cross entropy loss of the semantic segmentation result of the model for predicting the non-key frame and the labeled image in the training process as loss.
And 3, performing error back propagation by using a random shaving reduction algorithm according to the loss, and updating model parameters by using a polynomial learning strategy to obtain a trained semantic segmentation model. In the polynomial learning strategy, the learning rate lr is set as:
Figure BDA0002292180050000072
where baselr is the initial learning rate and step 3 is set to 5e-4Step 5 is set to 3e-3(ii) a The power setting is 0.9.
And 6, inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end.
Table 2 shows a comparison between the accuracy (mlio u) and the inference speed of the video semantic segmentation method and other most advanced methods, which means that the method can greatly increase the inference speed while maintaining a high accuracy, and when the mlio u is 60.6%, the inference speed can reach 131.6 fps:
TABLE 2 comparison of this video semantic segmentation method with other most advanced methods
Figure BDA0002292180050000073
FIG. 7 shows a partial example of the present invention on a Cityscapes dataset.

Claims (4)

1. A real-time video semantic segmentation method is characterized by comprising the following steps:
step 1, selecting a training and testing data set;
step 2, constructing a backbone network based on images;
the backbone network adopts an encoder-decoder architecture, an encoder comprises a residual error double-branch depth separable convolution module RDDS module and a downsampling module, wherein the RDDS module comprises two symmetrical branches, each branch comprises 1 depth separable convolution layer, 1 depth separable expansion convolution layer, 3 Batch Normalization layers, 1 ReLU activation layer and 1 dropout layer, and the two branches after Concat pass through a convolution layer of 1 x 1 convolution kernel and a ReLU activation layer; the down-sampling module is composed of a maximum pooling layer and a convolution layer Concat of 3 multiplied by 3 convolution kernel; the decoder comprises a convolution layer of 1 x 1 convolution kernel and an 8 times bilinear upsampling layer;
step 3, pre-training the backbone network by using a training data set;
step 4, constructing an integral video semantic segmentation model;
the integral video semantic segmentation model is based on a key frame selection mechanism; firstly, dividing an encoder part of a backbone network into a lower layer network part and a higher layer network part by taking a second downsampling layer as a boundary; if the current input video frame is a key frame, the whole backbone network is used for calculation, and a low-level feature map F is obtainedlkAnd a high level feature map Fhk(ii) a If the current input is a non-key frame, only the non-key frame is calculated by using the lower network of the encoder to obtain a feature map FlnThen using the global attention module to effect feature propagation, FlnAnd F of the previous key framelkAs an input, the calculation process in the global attention module is as follows:
(1) after the calculation of a second down-sampling module of a backbone network encoder, the low-level feature maps of the key frames and the non-key frames are obtained
Figure FDA0002748025800000011
Reduce the number of channels between the two channels, and reduce FlkAfter being rotated with FlnObtaining the graph by matrix multiplication
Figure FDA0002748025800000012
N=H×W;
(2) Inputting A 'into two parallel branches to obtain the maximum point-to-point response, respectively carrying out average pooling and maximum pooling on A' in a channel dimension, and then adding results Concat of the two branches to obtain a maximum response attention diagram with the channel number being 2;
(3) finally, 5 × 5 convolutional layers are used to reduce the number of channels to 1, and then Sigmoid active layers are used to limit the value to [ -1, 1]Finally, get the attention map
Figure FDA0002748025800000013
F of the previous key framehkMultiplying A to obtain the predicted high-level feature map F of the current non-key framehnWill FhnAnd FlnAdding to supplement the detail information; in both cases, a final semantic segmentation result is finally obtained through a decoder;
step 5, training the whole video semantic segmentation model on a training data set;
and 6, inputting the video frame of the test set, carrying out forward propagation in the trained video semantic segmentation model, and outputting a predicted semantic segmentation result end to end.
2. The method according to claim 1, wherein step 3 comprises:
step 3.1: preprocessing and data enhancing images in a training data set, redefining the images into fixed values, and using a data enhancing mode of horizontal turning, translation and color change;
step 3.2: initializing a whole image semantic segmentation model;
step 3.3: and recording the cross entropy loss of the semantic segmentation result predicted by the model and the labeled image in the training process as loss.
3. The method according to claim 1, wherein the step 5 comprises:
step 5.1: preprocessing and data enhancing images in a training data set, redefining the images into fixed values, and using a data enhancing mode of horizontal turning, translation and color change;
step 5.2: loading pre-trained model parameters of a backbone network, initializing a whole video semantic segmentation model, inputting a key frame-non-key frame image pair each time, wherein each continuous video segment comprises 1 key frame and n non-key frames;
step 5.3: and recording the cross entropy loss of the semantic segmentation result of the model for predicting the non-key frame and the labeled image in the training process as loss.
4. The real-time video semantic segmentation method according to claim 2 or 3, characterized in that for step 3 and step 5, a stochastic gradient descent algorithm is used for error back propagation according to loss, and a polynomial learning strategy is used to update model parameters to obtain a trained semantic segmentation model; in the polynomial learning strategy, the learning rate lr is set as:
Figure FDA0002748025800000021
where baselr is the initial learning rate and step 3 is set to 5e-4Step 5 is set to 3e-3(ii) a power is set to 0.9.
CN201911185021.4A 2019-11-27 2019-11-27 Real-time video semantic segmentation method Active CN111062395B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911185021.4A CN111062395B (en) 2019-11-27 2019-11-27 Real-time video semantic segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911185021.4A CN111062395B (en) 2019-11-27 2019-11-27 Real-time video semantic segmentation method

Publications (2)

Publication Number Publication Date
CN111062395A CN111062395A (en) 2020-04-24
CN111062395B true CN111062395B (en) 2020-12-18

Family

ID=70299046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911185021.4A Active CN111062395B (en) 2019-11-27 2019-11-27 Real-time video semantic segmentation method

Country Status (1)

Country Link
CN (1) CN111062395B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651421B (en) * 2020-09-04 2024-05-28 江苏濠汉信息技术有限公司 Infrared thermal imaging power transmission line anti-external-damage monitoring system and modeling method thereof
CN112364822B (en) * 2020-11-30 2022-08-19 重庆电子工程职业学院 Automatic driving video semantic segmentation system and method
CN112862839B (en) * 2021-02-24 2022-12-23 清华大学 Method and system for enhancing robustness of semantic segmentation of map elements
CN113177478B (en) * 2021-04-29 2022-08-05 西华大学 Short video semantic annotation method based on transfer learning
CN113505680B (en) * 2021-07-02 2022-07-15 兰州理工大学 Content-based bad content detection method for high-duration complex scene video
CN113658189B (en) * 2021-09-01 2022-03-11 北京航空航天大学 Cross-scale feature fusion real-time semantic segmentation method and system
CN116246075B (en) * 2023-05-12 2023-07-21 武汉纺织大学 Video semantic segmentation method combining dynamic information and static information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229336A (en) * 2017-12-13 2018-06-29 北京市商汤科技开发有限公司 Video identification and training method and device, electronic equipment, program and medium
CN109241972A (en) * 2018-08-20 2019-01-18 电子科技大学 Image, semantic dividing method based on deep learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9436876B1 (en) * 2014-12-19 2016-09-06 Amazon Technologies, Inc. Video segmentation techniques
CN108235116B (en) * 2017-12-27 2020-06-16 北京市商汤科技开发有限公司 Feature propagation method and apparatus, electronic device, and medium
CN109919044A (en) * 2019-02-18 2019-06-21 清华大学 The video semanteme dividing method and device of feature propagation are carried out based on prediction
CN110147763B (en) * 2019-05-20 2023-02-24 哈尔滨工业大学 Video semantic segmentation method based on convolutional neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229336A (en) * 2017-12-13 2018-06-29 北京市商汤科技开发有限公司 Video identification and training method and device, electronic equipment, program and medium
CN109241972A (en) * 2018-08-20 2019-01-18 电子科技大学 Image, semantic dividing method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
视觉注意力检测综述;王文冠 等;《软件学报》;20190215;第30卷(第2期);全文 *

Also Published As

Publication number Publication date
CN111062395A (en) 2020-04-24

Similar Documents

Publication Publication Date Title
CN111062395B (en) Real-time video semantic segmentation method
CN112634276B (en) Lightweight semantic segmentation method based on multi-scale visual feature extraction
CN110059772B (en) Remote sensing image semantic segmentation method based on multi-scale decoding network
CN110276354B (en) High-resolution streetscape picture semantic segmentation training and real-time segmentation method
CN110569851B (en) Real-time semantic segmentation method for gated multi-layer fusion
CN111696110B (en) Scene segmentation method and system
CN111832453B (en) Unmanned scene real-time semantic segmentation method based on two-way deep neural network
CN111652081B (en) Video semantic segmentation method based on optical flow feature fusion
CN111563507A (en) Indoor scene semantic segmentation method based on convolutional neural network
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN114565770A (en) Image segmentation method and system based on edge auxiliary calculation and mask attention
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN113870160A (en) Point cloud data processing method based on converter neural network
CN115393289A (en) Tumor image semi-supervised segmentation method based on integrated cross pseudo label
CN115496919A (en) Hybrid convolution-transformer framework based on window mask strategy and self-supervision method
CN116486080A (en) Lightweight image semantic segmentation method based on deep learning
CN115830575A (en) Transformer and cross-dimension attention-based traffic sign detection method
CN115995002B (en) Network construction method and urban scene real-time semantic segmentation method
CN110942463B (en) Video target segmentation method based on generation countermeasure network
US20240062347A1 (en) Multi-scale fusion defogging method based on stacked hourglass network
CN115909465A (en) Face positioning detection method, image processing device and readable storage medium
CN116310324A (en) Pyramid cross-layer fusion decoder based on semantic segmentation
CN116452472A (en) Low-illumination image enhancement method based on semantic knowledge guidance
CN113313721B (en) Real-time semantic segmentation method based on multi-scale structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant