CN115100409A - Video portrait segmentation algorithm based on twin network - Google Patents
Video portrait segmentation algorithm based on twin network Download PDFInfo
- Publication number
- CN115100409A CN115100409A CN202210759308.9A CN202210759308A CN115100409A CN 115100409 A CN115100409 A CN 115100409A CN 202210759308 A CN202210759308 A CN 202210759308A CN 115100409 A CN115100409 A CN 115100409A
- Authority
- CN
- China
- Prior art keywords
- module
- video frame
- network
- encoder
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 43
- 238000000926 separation method Methods 0.000 claims abstract description 7
- 238000010586 diagram Methods 0.000 claims description 26
- 230000004927 fusion Effects 0.000 claims description 16
- 238000005070 sampling Methods 0.000 claims description 16
- 238000000034 method Methods 0.000 claims description 15
- 238000007781 pre-processing Methods 0.000 claims description 14
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 5
- 230000000007 visual effect Effects 0.000 claims description 5
- 210000000988 bone and bone Anatomy 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000005284 excitation Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000000306 recurrent effect Effects 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000012300 Sequence Analysis Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 4
- 238000013135 deep learning Methods 0.000 abstract description 3
- 238000003672 processing method Methods 0.000 abstract description 3
- 238000003709 image segmentation Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 230000037237 body shape Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000004744 fabric Substances 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/54—Extraction of image or video features relating to texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biodiversity & Conservation Biology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a video portrait segmentation algorithm based on a twin network, which relates to the technical field of image processing, adopts a twin network structure, and has a basic structure comprising a video frame acquisition image module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module; the invention adopts a deep learning PyTorch frame to construct the module, and a model learning video processing method is used for predicting an accurate alpha mask for each frame of a video and extracting tasks from a given image or video so as to realize high-resolution video portrait segmentation in a complex scene.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to a video portrait segmentation algorithm based on a twin network and a conveying method thereof.
Background
In computer vision, image semantic segmentation is an important research subject of computer vision, and can be widely applied to various fields, for example, foreground segmentation of images can be used for changing backgrounds of videos, foreground characters can be integrated into different scenes, and creative algorithm application is generated.
The purpose of the human image segmentation is to predict an accurate alpha mask that can be used to extract a human from a given image or video. It has wide applications such as photo editing and movie creation. The video portrait segmentation algorithm aims to predict a video frame alpha covering picture in a complex scene through the video portrait segmentation algorithm to perform foreground and background segmentation. The existing ground real-time high-resolution video portrait segmentation algorithm can obtain high-quality prediction only by means of green cloth, and the algorithm without the green cloth also has some problems, such as data sets need to be subjected to three-division mapping, and the cost for obtaining the three-division mapping is also high.
Therefore, a video portrait segmentation algorithm based on a twin network is urgently needed to solve the problems.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a video portrait segmentation algorithm based on a twin network, which can ensure high-precision segmentation of video portraits in various environments with complex backgrounds, complex main body shapes and the like.
In order to achieve the purpose, the invention designs and realizes a video portrait segmentation algorithm based on a twin network, and a high-resolution alpha Mongolian layout is obtained through twin network sharing weight, cyclic neural network capturing time sequence and spatial characteristics and combined up-sampling. The technical scheme is as follows: a video portrait segmentation algorithm based on a twin network adopts a twin network structure, the basic structure of which comprises a video frame image acquisition module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module, and the method comprises the following steps:
step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: separating the obtained preprocessed video frame image into three-channel RGB video frame images in an RGB color mode through the RGB separation module;
step S3: inputting three-channel RGB video frame images through the Encoder network module, and extracting multi-scale coarse-grained characteristics of the five three-channel RGB video frame images by adopting a Mobilene V3 network;
step S4: connecting the Encoder network module and the Decoder network module through the SE module, and re-calibrating characteristics by learning the importance degree of each channel;
step S5: obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU circulating neural network through the Decoder network module, performing characteristic fusion, capturing edge characteristics lost in downsampling, shallow-layer textural characteristics and time sequence and spatial characteristics, and obtaining a high-resolution characteristic diagram;
step S6: three characteristics of different scales are obtained from a current video frame, a current video frame downsampling and Decoder network module through the JPU module, and a high-resolution characteristic map is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.
Further, the obtaining, by the video frame obtaining module, a current video frame image from a video to be segmented and performing preprocessing to obtain a preprocessed current video frame image includes: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.
Furthermore, the current video frame image is obtained from the video to be segmented through the video frame obtaining module and is preprocessed, so that the preprocessed current frequency frame image is obtained.
Further, the three-channel RGB video frame images are input to pass through the Encoder network module, a Mobilene V3 network is adopted, multi-scale coarse-grained characteristics of five three-channel RGB video frame images are extracted, a light-weight network Mobilene V3Large is adopted as a back bone, a four-level Encoder is built on the basis of a twin network, and coarse-grained characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a down-sampling layer and the four-level Encoder.
Furthermore, the Encoder network module comprises a down-sampling layer and a four-level coder, wherein the down-sampling layer adopts bilinear interpolation to carry out 4 times down-sampling to obtain a feature map of the original image resolution 1/4; the four-stage encoder comprises a first encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.
Further, the connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained features into the SE module, and performs feature recalibration at a feature channel level by learning the importance degree of each channel, including: the system is used for converting the obtained coarse-grained features into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (4) performing Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and re-calibrating the features.
Further, the method includes the steps of obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU circulating neural network through the Decoder network module, performing characteristic fusion, capturing edge characteristics lost in the downsampling, superficial layer textural characteristics and time sequence and space characteristics, and gradually restoring and amplifying high-layer semantic information through a four-stage Decoder corresponding to the Encoder module to obtain a high-resolution characteristic diagram.
Furthermore, the four-level decoder is used for multi-level feature fusion, channel number reduction and high-resolution feature map obtaining of feature maps of current video frame resolutions 1/32, 1/16, 1/8 and 1/4; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.
Further, the efficient generation of a high resolution feature map given the corresponding low resolution output and high resolution image by the JPU module to obtain three different scales of features from the current video frame, the current video frame down-sampling, the Decoder network module comprises the steps of S41: performing feature fusion on three features with different scales obtained by the current video frame, the downsampling of the current video frame and the Decoder network module, and outputting a feature map; step S42: using separable convolution groups with different cavity rates to enlarge a visual field, capturing context information, outputting four groups of feature maps with unchanged resolution, and merging and fusing multi-scale context information; step S43: and generating an alpha Mongolian layout with the channel number of 1 by using 3X 3 2D convolution on the fused multi-scale context information.
Furthermore, the feature fusion is performed on the features of the current video frame, the downsampling of the current video frame and the three different scales obtained by the Decoder network module, and the outputting of the feature map comprises the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.
The technical scheme can show that the invention has the advantages that:
compared with the prior art, the method can capture edge features, shallow texture features, time sequence, spatial features and other multi-level features, can supplement the time sequence, space and edge structural information of the alpha mask map of the current frame of the video, and realizes accurate prediction of the alpha mask, thereby segmenting the portrait from the background. The invention obtains accurate segmentation of portrait edge in various complex environments with low contrast between foreground and background, complex main body shape and the like, and has stronger robustness.
The method can perform high-precision video portrait segmentation on targets under complex scenes such as multiple targets, target shielding, tiny targets and fast target movement, and adopts a deep learning PyTorch frame to construct a learning video processing method of each model. The pre-training model of the video portrait segmentation algorithm based on the twin network is superior to other algorithms in terms of index results and visual effects on a test data set.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a video portrait segmentation algorithm based on a twin network according to the present invention.
FIG. 2 is a schematic diagram of the overall network structure of a video portrait segmentation algorithm based on a twin network according to the present invention.
FIG. 3 is a diagram illustrating a step of obtaining a video frame image according to the present invention.
FIG. 4 is a diagram illustrating a pre-processing procedure for a current video frame according to the present invention.
Fig. 5 is a schematic structural diagram of an Encoder network module bottleeck according to the present invention.
Fig. 6 is a detailed network structure diagram of an Encoder network module according to the present invention.
Fig. 7 is a schematic structural diagram of a Decoder module according to the present invention.
FIG. 8 is a schematic diagram of a JPU module according to the present invention.
FIG. 9 is a step diagram of a JPU module according to the present invention.
FIG. 10 is a diagram illustrating the effect of video image segmentation in different scenes by the video image segmentation algorithm based on the twin network according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments and the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
The invention designs and realizes a video portrait segmentation algorithm based on a twin network, performs multi-scale feature fusion based on the twin network, guides the algorithm network to capture edge features, superficial texture features, time sequence and space features, and can rapidly and accurately perform video portrait segmentation.
Fig. 1 and 2 show a flow chart and a general network structure diagram of a video portrait segmentation algorithm based on a twin network.
According to the video portrait segmentation algorithm based on the twin network shown in the figures 1 and 2, the video portrait segmentation algorithm based on the twin network aims to obtain a more accurate alpha Mongolian layout by constructing the video portrait segmentation algorithm and combining semantic information, time sequence and spatial information and structural detail information of the video portrait. The video portrait segmentation algorithm based on the twin network adopts a twin network structure, the basic structure of the algorithm comprises a video frame acquisition image, an RGB separation module, an Encoder network module, an SE (squeeze and excitation) module, a Decoder network module and a JPU (Joint Pyramid) module, and the method specifically comprises the following steps:
step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: separating the obtained preprocessed video frame image into three-channel RGB video frame images in an RGB color mode through the RGB separation module;
step S3: inputting three-channel RGB video frame images through the Encoder network module, and extracting multi-scale coarse-grained characteristics of the five three-channel RGB video frame images by adopting a Mobilene V3 network;
step S4: connecting the Encoder network module and the Decoder network module through the SE module, and re-calibrating characteristics by learning the importance degree of each channel;
step S5: obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU recurrent neural network through the Decoder network module, performing characteristic fusion, and capturing edge characteristics lost in downsampling, shallow layer textural characteristics and time sequence and spatial characteristics;
step S6: the JPU module obtains three characteristics with different scales from a current video frame, a current video frame downsampling and a Decoder network module, and a high-resolution characteristic map is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.
Fig. 3 and 4 show diagrams of steps for obtaining video frame images.
According to the obtaining of the video frame image shown in fig. 3, the obtaining and preprocessing the current video frame image of the video to be segmented includes:
step S11: acquiring a current video frame image of a video to be segmented;
step S12: and preprocessing the acquired current video frame image.
According to the preprocessing of the current video frame shown in fig. 4, the preprocessing of the current video frame includes:
step S121: adjusting the size of the video to be segmented to be a preset size, wherein the preset size is the size of an input image required by the twin network;
step S122: normalizing the pixels of the image after the size adjustment;
step S123: and adjusting the order of the color channels of the normalized image according to a preset order.
The current video frame is processed into an image form suitable for a twin network structure by preprocessing the current video frame, so that the image input is facilitated and the division is accurate.
Fig. 5 and fig. 6 show an Encoder network module bottleeck structure schematic diagram and a detailed network schematic diagram.
According to the Encoder network module shown in FIG. 5 and FIG. 6, the Encoder network module is of a twin network structure, a lightweight network Mobilene V3Large specially designed for semantic segmentation is selected as a back bone, and a four-level Encoder is constructed based on the twin network.
The three-channel RGB video frame images are input to pass through the Encoder network module, the multi-scale coarse-grained characteristics of five three-channel RGB video frame images are extracted by adopting a Mobilene V3 network, a lightweight network Mobilene V3Large is adopted as a back bone, a four-level Encoder is constructed on the basis of a twin network, and coarse-grained characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a down-sampling layer and the four-level Encoder.
The down-sampling layer adopts bilinear interpolation to carry out 4 times down-sampling, so that a feature map of original image resolution 1/4 of the current video frame is obtained; the four-stage encoder comprises a first-stage encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits the shallow features containing the structured information to the deep features through short links.
The four-level Encoder includes a first-level Encoder Encode _ Blk1, a second-level Encoder Encode _ Blk2, a third-level Encoder Encode _ Blk3 and a fourth-level Encoder Encode _ Blk4, the first-level Encoder Encode _ Blk1, the second-level Encoder Encode _ Blk2, the third-level Encoder Encode _ Blk3 and the fourth-level Encoder Encode _ Blk4 use a plurality of weight-shared bittleneck structure, the bittleneck is a reverse residual structure, firstly a point-by-point convolution group (2D convolution of 1 × 1 + batch processing + activation layer), secondly a depth convolution group (2D convolution of 3 × 3 + batch processing + activation layer), and a SE module learning weight are connected, and finally a shallow layer feature containing structured detail information is transferred to a deep layer feature through a short shortcut link.
Specifically, the first-level Encoder _ Blk1 includes two bottomleneck blocks to obtain a feature map of the original image resolution 1/8, the second-level Encoder _ Blk2 includes two bottomleneck blocks to obtain a feature map of the original image resolution 1/16, the third-level Encoder _ Blk3 includes three bottomleneck blocks to obtain a feature map of the original image resolution 1/32, and the fourth-level Encoder _ Blk4 includes six bottomleneck blocks to obtain a feature map of the original image resolution 1/64, so as to obtain a multi-scale coarse-grained feature of five three-channel RGB video frame resolutions.
The connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained characteristics into the SE module, and recalibrates the characteristic channel-level characteristics by learning the importance degree of each channel, including: the method is used for converting the obtained coarse-grained feature into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (4) performing Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and re-calibrating the features. And the SE module is used for obtaining the weight coefficient of each channel, so that the model has higher discrimination capability on the characteristics of each channel.
Fig. 7 shows a detailed network structure diagram of a Decoder network module.
According to a Decoder network module shown in fig. 7, the Decoder network module is a twin network, weight sharing is performed among a plurality of Decoder blocks, and a pseudo-twin network is formed with the Encoder network module. The Decoder network module is used for obtaining four different scale characteristics from a current video frame, the Encoder module, a downsampling result (Image LR) of the current video frame and a ConvGRU circulating neural network and performing characteristic fusion to obtain edge characteristics lost in downsampling, shallow-layer textural characteristics and characteristics based on time sequence and space.
In order to reduce the number of parameters and calculation, the four-level decoder corresponding to the Encoder module splits the input on the channel dimension, the ConvGRU recurrent neural networks in all the modules calculate by using the split characteristics, and the rest are merged with the result through short links. The four-level decoder is used for multi-layer feature fusion, channel number reduction and high-resolution feature map obtaining, and feature maps of current video frame resolutions 1/32, 1/16, 1/8 and 1/4 are obtained respectively; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.
Specifically, the four-stage decoder comprises a 3 × 3 2D convolution + batch normalization + ReLU activation combination, a ConvGRU loop network and 2-fold bilinear interpolation upsampling, the input of the decoder is similar to the conventional U-net structural upsampling, the input of the decoder is combined by using the output of the downsampling process, and after the 3 × 3 2D convolution + batch normalization + ReLU activation combination, the ConvGRU loop network is calculated and output by using the information of the previous frame and the current frame.
Fig. 8 and 9 show a JPU module structure diagram and a step diagram.
According to the JPU module shown in fig. 8 and 9, the JPU module is configured to convert the extracted high-resolution feature map into joint upsampling, and is configured to effectively generate a high-resolution Image given guidance of corresponding low-resolution output (Image LR, Decoder network module output) and high-resolution Image (Image HR) by using three different scales of features obtained by the current video frame (Image HR), the downsampled current video frame (Image LR) and the Decoder network module. The JPU module comprises the following steps:
step S41: performing feature fusion on three features with different scales obtained by a current video frame (Image HR), a current video frame downsampling (Image LR) and a Decoder network module, and outputting a feature graph;
step S42: using separable convolution groups with different cavity rates to increase the visual field, capturing context information, outputting four groups of feature graphs with unchanged resolution, and fusing multi-scale context information through merging (Concatenate);
step S43: an alpha Mongolian layout with the number of channels of 1 is generated by using a 3X 3 2D convolution.
In order to reduce the computational complexity and parameter amount of the convolution operation, the common standard convolution is replaced by a separable convolution group consisting of the hole convolution and the point-by-point convolution with different hole rates. The standard convolution is replaced by a 3 × 3 depth convolution and a 1 × 1 point-by-point convolution through the decoupling operation of the channel correlation and the spatial correlation.
Specifically, the feature fusion is performed on the current video frame, the down-sampled feature of the current video frame and the feature of three different scales obtained by the Decoder network module, and the feature graph output includes the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.
FIG. 10 shows a portrait segmentation effect diagram of a video portrait segmentation algorithm based on a twin network in different scenes.
According to the figure 10, which shows the human image segmentation effect diagram in different scenes, it can be seen that the human image segmentation method can accurately segment the human image edges and further segment the human images under various complex environments such as low foreground and background contrast, complex background, complex main body shape and the like, and has strong robustness. The method can capture edge features, shallow-layer textural features, time sequence, spatial features and other multi-level features, can supplement the time sequence, space and edge structural information of the alpha mask map of the current frame of the video, and can realize accurate prediction of the alpha mask map so as to segment the portrait from the background.
The method can perform high-precision video portrait segmentation on targets under complex scenes such as multiple targets, target shielding, tiny targets and fast target movement, and adopts a deep learning PyTorch frame to construct a learning video processing method of each model. The pre-training model of the video portrait segmentation algorithm based on the twin network is superior to other algorithms in terms of index results and visual effects on a test data set.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A video portrait segmentation algorithm based on a twin network is characterized in that a twin network structure is adopted, the basic structure of the algorithm comprises a video frame image acquisition module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module, and the algorithm comprises the following steps:
step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: separating the obtained preprocessed video frame image into three-channel RGB video frame images in an RGB color mode through the RGB separation module;
step S3: inputting three-channel RGB video frame images into the Encoder network module, and extracting multi-scale coarse-grained characteristics of the five three-channel RGB video frame images by adopting a Mobilene V3 network;
step S4: connecting the Encoder network module and the Decoder network module through the SE module, inputting coarse-grained characteristics into the SE module, and performing characteristic channel-level characteristic recalibration by learning the importance degree of each channel;
step S5: obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU recurrent neural network through the Decoder network module, performing characteristic fusion, and capturing edge characteristics lost in downsampling, shallow layer textural characteristics and time sequence and spatial characteristics;
step S6: the JPU module is used for carrying out downsampling on a current video frame and the Decoder network module is used for obtaining three characteristics with different scales, and a high-resolution characteristic diagram is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.
2. The twin network based video portrait segmentation algorithm of claim 1, wherein the obtaining, through the video frame obtaining module, a current video frame image from a video to be segmented and performing pre-processing, and obtaining the pre-processed current video frame image comprises: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.
3. A twin network based video portrait segmentation algorithm as claimed in claim 2, wherein the pre-processing of the current video frame comprises: step S121: adjusting the size of the video to be segmented to be a preset size, wherein the preset size is the size of an input image required by the twin network; step S122: normalizing the pixels of the image after the size adjustment; step S123: and adjusting the order of the color channels of the normalized image according to a preset order.
4. The twin network based video portrait segmentation algorithm of claim 1, wherein the three-channel RGB video frame images are input into the Encoder network module, a mobilene V3 network is adopted to extract multi-scale coarse-grained features of five three-channel RGB video frame images, and the multi-scale coarse-grained features comprise a lightweight network mobilene V3Large as a back bone, a four-level Encoder is constructed based on the twin network, and coarse-grained feature maps of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a downsampling layer and the four-level Encoder.
5. The twin network based video portrait segmentation algorithm of claim 4, wherein the downsampling layer performs 4 times downsampling by using bilinear interpolation to obtain a feature map of original image resolution 1/4; the four-stage encoder comprises a first encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.
6. The twin network based video face segmentation algorithm of claim 1, wherein the connecting the Encoder network module and the Decoder network module through the SE module, inputting coarse-grained features into the SE module, and performing feature channel-level feature re-calibration by learning to the importance of each channel comprises: the system is used for converting the obtained coarse-grained features into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (3) performing an Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and calibrating the features again.
7. The twin network based video portrait segmentation algorithm of claim 1, wherein the obtaining, through the Decoder network module, different scale features from the Encoder network module, the current video frame image downsampling and the ConvGRU recurrent neural network, and performing feature fusion to capture edge features lost in the downsampling, shallow layer grammatical features, and time sequence and spatial features, includes gradually restoring and amplifying high-level semantic information through a four-level Decoder corresponding to the Encoder module to obtain a high-resolution feature map.
8. The twin network based video portrait segmentation algorithm of claim 7, wherein the four-level decoder is configured for multi-level feature fusion, channel number reduction and high resolution feature map acquisition, respectively obtaining feature maps of current video frame resolutions 1/32, 1/16, 1/8, 1/4; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.
9. The twin network based video portrait segmentation algorithm of claim 1, wherein the efficient generation of the high resolution feature map given the respective low resolution output and high resolution image by the JPU module to obtain three different scales of features from the current video frame, the current video frame down-sampling, the Decoder network module comprises the steps of S41: performing feature fusion on three features with different scales obtained by the current video frame, the downsampling of the current video frame and the Decoder network module, and outputting a feature map; step S42: using separable convolution groups with different cavity rates to increase the visual field, capturing context information, outputting four groups of feature graphs with unchanged resolution, and merging and fusing multi-scale context information; step S43: and generating an alpha Mongolian layout with the channel number of 1 by using 3X 3 2D convolution on the fused multi-scale context information.
10. The twin network based video portrait segmentation algorithm of claim 9, wherein the feature fusion of the features of the three different scales obtained from the current video frame, the downsampling of the current video frame and the Decoder network module, the outputting the feature map comprises the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210759308.9A CN115100409B (en) | 2022-06-30 | 2022-06-30 | Video portrait segmentation algorithm based on twin network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210759308.9A CN115100409B (en) | 2022-06-30 | 2022-06-30 | Video portrait segmentation algorithm based on twin network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115100409A true CN115100409A (en) | 2022-09-23 |
CN115100409B CN115100409B (en) | 2024-04-26 |
Family
ID=83295324
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210759308.9A Active CN115100409B (en) | 2022-06-30 | 2022-06-30 | Video portrait segmentation algorithm based on twin network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115100409B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116205928A (en) * | 2023-05-06 | 2023-06-02 | 南方医科大学珠江医院 | Image segmentation processing method, device and equipment for laparoscopic surgery video and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287940A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Semantic segmentation method of attention mechanism based on deep learning |
WO2021088300A1 (en) * | 2019-11-09 | 2021-05-14 | 北京工业大学 | Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network |
CN114299944A (en) * | 2021-12-08 | 2022-04-08 | 天翼爱音乐文化科技有限公司 | Video processing method, system, device and storage medium |
-
2022
- 2022-06-30 CN CN202210759308.9A patent/CN115100409B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021088300A1 (en) * | 2019-11-09 | 2021-05-14 | 北京工业大学 | Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network |
CN112287940A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Semantic segmentation method of attention mechanism based on deep learning |
CN114299944A (en) * | 2021-12-08 | 2022-04-08 | 天翼爱音乐文化科技有限公司 | Video processing method, system, device and storage medium |
Non-Patent Citations (1)
Title |
---|
洪庆;宋乔;杨晨涛;张培;常连立;: "基于智能视觉的机械零件图像分割技术", 机械制造与自动化, no. 05, 20 October 2020 (2020-10-20) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116205928A (en) * | 2023-05-06 | 2023-06-02 | 南方医科大学珠江医院 | Image segmentation processing method, device and equipment for laparoscopic surgery video and medium |
CN116205928B (en) * | 2023-05-06 | 2023-07-18 | 南方医科大学珠江医院 | Image segmentation processing method, device and equipment for laparoscopic surgery video and medium |
Also Published As
Publication number | Publication date |
---|---|
CN115100409B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110111366B (en) | End-to-end optical flow estimation method based on multistage loss | |
CN110287849B (en) | Lightweight depth network image target detection method suitable for raspberry pi | |
CN113052210B (en) | Rapid low-light target detection method based on convolutional neural network | |
CN112733950A (en) | Power equipment fault diagnosis method based on combination of image fusion and target detection | |
CN111861880B (en) | Image super-fusion method based on regional information enhancement and block self-attention | |
CN112686207B (en) | Urban street scene target detection method based on regional information enhancement | |
CN113222124B (en) | SAUNet + + network for image semantic segmentation and image semantic segmentation method | |
CN111429466A (en) | Space-based crowd counting and density estimation method based on multi-scale information fusion network | |
CN111369565A (en) | Digital pathological image segmentation and classification method based on graph convolution network | |
CN111951288A (en) | Skin cancer lesion segmentation method based on deep learning | |
CN116797787B (en) | Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network | |
WO2023138629A1 (en) | Encrypted image information obtaining device and method | |
CN116486074A (en) | Medical image segmentation method based on local and global context information coding | |
CN115100409A (en) | Video portrait segmentation algorithm based on twin network | |
CN113838102B (en) | Optical flow determining method and system based on anisotropic dense convolution | |
CN112906675B (en) | Method and system for detecting non-supervision human body key points in fixed scene | |
CN117409244A (en) | SCKConv multi-scale feature fusion enhanced low-illumination small target detection method | |
CN112232221A (en) | Method, system and program carrier for processing human image | |
CN116091793A (en) | Light field significance detection method based on optical flow fusion | |
Schirrmacher et al. | SR 2: Super-resolution with structure-aware reconstruction | |
CN115330655A (en) | Image fusion method and system based on self-attention mechanism | |
CN115731280A (en) | Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network | |
CN111950496B (en) | Mask person identity recognition method | |
US11790633B2 (en) | Image processing using coupled segmentation and edge learning | |
CN113192018A (en) | Water-cooled wall surface defect video identification method based on fast segmentation convolutional neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |