CN115100409A

CN115100409A - Video portrait segmentation algorithm based on twin network

Info

Publication number: CN115100409A
Application number: CN202210759308.9A
Authority: CN
Inventors: 张笑钦; 廖唐飞; 赵丽; 冯士杰; 徐曰旺
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-23
Anticipated expiration: 2042-06-30
Also published as: CN115100409B

Abstract

The invention discloses a video portrait segmentation algorithm based on a twin network, which relates to the technical field of image processing, adopts a twin network structure, and has a basic structure comprising a video frame acquisition image module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module; the invention adopts a deep learning PyTorch frame to construct the module, and a model learning video processing method is used for predicting an accurate alpha mask for each frame of a video and extracting tasks from a given image or video so as to realize high-resolution video portrait segmentation in a complex scene.

Description

Video portrait segmentation algorithm based on twin network

Technical Field

The invention relates to the technical field of image processing, in particular to a video portrait segmentation algorithm based on a twin network and a conveying method thereof.

Background

In computer vision, image semantic segmentation is an important research subject of computer vision, and can be widely applied to various fields, for example, foreground segmentation of images can be used for changing backgrounds of videos, foreground characters can be integrated into different scenes, and creative algorithm application is generated.

The purpose of the human image segmentation is to predict an accurate alpha mask that can be used to extract a human from a given image or video. It has wide applications such as photo editing and movie creation. The video portrait segmentation algorithm aims to predict a video frame alpha covering picture in a complex scene through the video portrait segmentation algorithm to perform foreground and background segmentation. The existing ground real-time high-resolution video portrait segmentation algorithm can obtain high-quality prediction only by means of green cloth, and the algorithm without the green cloth also has some problems, such as data sets need to be subjected to three-division mapping, and the cost for obtaining the three-division mapping is also high.

Therefore, a video portrait segmentation algorithm based on a twin network is urgently needed to solve the problems.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a video portrait segmentation algorithm based on a twin network, which can ensure high-precision segmentation of video portraits in various environments with complex backgrounds, complex main body shapes and the like.

In order to achieve the purpose, the invention designs and realizes a video portrait segmentation algorithm based on a twin network, and a high-resolution alpha Mongolian layout is obtained through twin network sharing weight, cyclic neural network capturing time sequence and spatial characteristics and combined up-sampling. The technical scheme is as follows: a video portrait segmentation algorithm based on a twin network adopts a twin network structure, the basic structure of which comprises a video frame image acquisition module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module, and the method comprises the following steps:

step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;

step S2: separating the obtained preprocessed video frame image into three-channel RGB video frame images in an RGB color mode through the RGB separation module;

step S3: inputting three-channel RGB video frame images through the Encoder network module, and extracting multi-scale coarse-grained characteristics of the five three-channel RGB video frame images by adopting a Mobilene V3 network;

step S4: connecting the Encoder network module and the Decoder network module through the SE module, and re-calibrating characteristics by learning the importance degree of each channel;

step S5: obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU circulating neural network through the Decoder network module, performing characteristic fusion, capturing edge characteristics lost in downsampling, shallow-layer textural characteristics and time sequence and spatial characteristics, and obtaining a high-resolution characteristic diagram;

step S6: three characteristics of different scales are obtained from a current video frame, a current video frame downsampling and Decoder network module through the JPU module, and a high-resolution characteristic map is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.

Further, the obtaining, by the video frame obtaining module, a current video frame image from a video to be segmented and performing preprocessing to obtain a preprocessed current video frame image includes: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.

Furthermore, the current video frame image is obtained from the video to be segmented through the video frame obtaining module and is preprocessed, so that the preprocessed current frequency frame image is obtained.

Further, the three-channel RGB video frame images are input to pass through the Encoder network module, a Mobilene V3 network is adopted, multi-scale coarse-grained characteristics of five three-channel RGB video frame images are extracted, a light-weight network Mobilene V3Large is adopted as a back bone, a four-level Encoder is built on the basis of a twin network, and coarse-grained characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a down-sampling layer and the four-level Encoder.

Furthermore, the Encoder network module comprises a down-sampling layer and a four-level coder, wherein the down-sampling layer adopts bilinear interpolation to carry out 4 times down-sampling to obtain a feature map of the original image resolution 1/4; the four-stage encoder comprises a first encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.

Further, the connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained features into the SE module, and performs feature recalibration at a feature channel level by learning the importance degree of each channel, including: the system is used for converting the obtained coarse-grained features into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (4) performing Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and re-calibrating the features.

Further, the method includes the steps of obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU circulating neural network through the Decoder network module, performing characteristic fusion, capturing edge characteristics lost in the downsampling, superficial layer textural characteristics and time sequence and space characteristics, and gradually restoring and amplifying high-layer semantic information through a four-stage Decoder corresponding to the Encoder module to obtain a high-resolution characteristic diagram.

Furthermore, the four-level decoder is used for multi-level feature fusion, channel number reduction and high-resolution feature map obtaining of feature maps of current video frame resolutions 1/32, 1/16, 1/8 and 1/4; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.

Further, the efficient generation of a high resolution feature map given the corresponding low resolution output and high resolution image by the JPU module to obtain three different scales of features from the current video frame, the current video frame down-sampling, the Decoder network module comprises the steps of S41: performing feature fusion on three features with different scales obtained by the current video frame, the downsampling of the current video frame and the Decoder network module, and outputting a feature map; step S42: using separable convolution groups with different cavity rates to enlarge a visual field, capturing context information, outputting four groups of feature maps with unchanged resolution, and merging and fusing multi-scale context information; step S43: and generating an alpha Mongolian layout with the channel number of 1 by using 3X 3 2D convolution on the fused multi-scale context information.

Furthermore, the feature fusion is performed on the features of the current video frame, the downsampling of the current video frame and the three different scales obtained by the Decoder network module, and the outputting of the feature map comprises the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.

The technical scheme can show that the invention has the advantages that:

compared with the prior art, the method can capture edge features, shallow texture features, time sequence, spatial features and other multi-level features, can supplement the time sequence, space and edge structural information of the alpha mask map of the current frame of the video, and realizes accurate prediction of the alpha mask, thereby segmenting the portrait from the background. The invention obtains accurate segmentation of portrait edge in various complex environments with low contrast between foreground and background, complex main body shape and the like, and has stronger robustness.

The method can perform high-precision video portrait segmentation on targets under complex scenes such as multiple targets, target shielding, tiny targets and fast target movement, and adopts a deep learning PyTorch frame to construct a learning video processing method of each model. The pre-training model of the video portrait segmentation algorithm based on the twin network is superior to other algorithms in terms of index results and visual effects on a test data set.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a video portrait segmentation algorithm based on a twin network according to the present invention.

FIG. 2 is a schematic diagram of the overall network structure of a video portrait segmentation algorithm based on a twin network according to the present invention.

FIG. 3 is a diagram illustrating a step of obtaining a video frame image according to the present invention.

FIG. 4 is a diagram illustrating a pre-processing procedure for a current video frame according to the present invention.

Fig. 5 is a schematic structural diagram of an Encoder network module bottleeck according to the present invention.

Fig. 6 is a detailed network structure diagram of an Encoder network module according to the present invention.

Fig. 7 is a schematic structural diagram of a Decoder module according to the present invention.

FIG. 8 is a schematic diagram of a JPU module according to the present invention.

FIG. 9 is a step diagram of a JPU module according to the present invention.

FIG. 10 is a diagram illustrating the effect of video image segmentation in different scenes by the video image segmentation algorithm based on the twin network according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments and the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

The invention designs and realizes a video portrait segmentation algorithm based on a twin network, performs multi-scale feature fusion based on the twin network, guides the algorithm network to capture edge features, superficial texture features, time sequence and space features, and can rapidly and accurately perform video portrait segmentation.

Fig. 1 and 2 show a flow chart and a general network structure diagram of a video portrait segmentation algorithm based on a twin network.

According to the video portrait segmentation algorithm based on the twin network shown in the figures 1 and 2, the video portrait segmentation algorithm based on the twin network aims to obtain a more accurate alpha Mongolian layout by constructing the video portrait segmentation algorithm and combining semantic information, time sequence and spatial information and structural detail information of the video portrait. The video portrait segmentation algorithm based on the twin network adopts a twin network structure, the basic structure of the algorithm comprises a video frame acquisition image, an RGB separation module, an Encoder network module, an SE (squeeze and excitation) module, a Decoder network module and a JPU (Joint Pyramid) module, and the method specifically comprises the following steps:

step S5: obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU recurrent neural network through the Decoder network module, performing characteristic fusion, and capturing edge characteristics lost in downsampling, shallow layer textural characteristics and time sequence and spatial characteristics;

step S6: the JPU module obtains three characteristics with different scales from a current video frame, a current video frame downsampling and a Decoder network module, and a high-resolution characteristic map is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.

Fig. 3 and 4 show diagrams of steps for obtaining video frame images.

According to the obtaining of the video frame image shown in fig. 3, the obtaining and preprocessing the current video frame image of the video to be segmented includes:

step S11: acquiring a current video frame image of a video to be segmented;

step S12: and preprocessing the acquired current video frame image.

According to the preprocessing of the current video frame shown in fig. 4, the preprocessing of the current video frame includes:

step S121: adjusting the size of the video to be segmented to be a preset size, wherein the preset size is the size of an input image required by the twin network;

step S122: normalizing the pixels of the image after the size adjustment;

step S123: and adjusting the order of the color channels of the normalized image according to a preset order.

The current video frame is processed into an image form suitable for a twin network structure by preprocessing the current video frame, so that the image input is facilitated and the division is accurate.

Fig. 5 and fig. 6 show an Encoder network module bottleeck structure schematic diagram and a detailed network schematic diagram.

According to the Encoder network module shown in FIG. 5 and FIG. 6, the Encoder network module is of a twin network structure, a lightweight network Mobilene V3Large specially designed for semantic segmentation is selected as a back bone, and a four-level Encoder is constructed based on the twin network.

The three-channel RGB video frame images are input to pass through the Encoder network module, the multi-scale coarse-grained characteristics of five three-channel RGB video frame images are extracted by adopting a Mobilene V3 network, a lightweight network Mobilene V3Large is adopted as a back bone, a four-level Encoder is constructed on the basis of a twin network, and coarse-grained characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a down-sampling layer and the four-level Encoder.

The down-sampling layer adopts bilinear interpolation to carry out 4 times down-sampling, so that a feature map of original image resolution 1/4 of the current video frame is obtained; the four-stage encoder comprises a first-stage encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits the shallow features containing the structured information to the deep features through short links.

The four-level Encoder includes a first-level Encoder Encode _ Blk1, a second-level Encoder Encode _ Blk2, a third-level Encoder Encode _ Blk3 and a fourth-level Encoder Encode _ Blk4, the first-level Encoder Encode _ Blk1, the second-level Encoder Encode _ Blk2, the third-level Encoder Encode _ Blk3 and the fourth-level Encoder Encode _ Blk4 use a plurality of weight-shared bittleneck structure, the bittleneck is a reverse residual structure, firstly a point-by-point convolution group (2D convolution of 1 × 1 + batch processing + activation layer), secondly a depth convolution group (2D convolution of 3 × 3 + batch processing + activation layer), and a SE module learning weight are connected, and finally a shallow layer feature containing structured detail information is transferred to a deep layer feature through a short shortcut link.

Specifically, the first-level Encoder _ Blk1 includes two bottomleneck blocks to obtain a feature map of the original image resolution 1/8, the second-level Encoder _ Blk2 includes two bottomleneck blocks to obtain a feature map of the original image resolution 1/16, the third-level Encoder _ Blk3 includes three bottomleneck blocks to obtain a feature map of the original image resolution 1/32, and the fourth-level Encoder _ Blk4 includes six bottomleneck blocks to obtain a feature map of the original image resolution 1/64, so as to obtain a multi-scale coarse-grained feature of five three-channel RGB video frame resolutions.

The connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained characteristics into the SE module, and recalibrates the characteristic channel-level characteristics by learning the importance degree of each channel, including: the method is used for converting the obtained coarse-grained feature into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (4) performing Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and re-calibrating the features. And the SE module is used for obtaining the weight coefficient of each channel, so that the model has higher discrimination capability on the characteristics of each channel.

Fig. 7 shows a detailed network structure diagram of a Decoder network module.

According to a Decoder network module shown in fig. 7, the Decoder network module is a twin network, weight sharing is performed among a plurality of Decoder blocks, and a pseudo-twin network is formed with the Encoder network module. The Decoder network module is used for obtaining four different scale characteristics from a current video frame, the Encoder module, a downsampling result (Image LR) of the current video frame and a ConvGRU circulating neural network and performing characteristic fusion to obtain edge characteristics lost in downsampling, shallow-layer textural characteristics and characteristics based on time sequence and space.

In order to reduce the number of parameters and calculation, the four-level decoder corresponding to the Encoder module splits the input on the channel dimension, the ConvGRU recurrent neural networks in all the modules calculate by using the split characteristics, and the rest are merged with the result through short links. The four-level decoder is used for multi-layer feature fusion, channel number reduction and high-resolution feature map obtaining, and feature maps of current video frame resolutions 1/32, 1/16, 1/8 and 1/4 are obtained respectively; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.

Specifically, the four-stage decoder comprises a 3 × 3 2D convolution + batch normalization + ReLU activation combination, a ConvGRU loop network and 2-fold bilinear interpolation upsampling, the input of the decoder is similar to the conventional U-net structural upsampling, the input of the decoder is combined by using the output of the downsampling process, and after the 3 × 3 2D convolution + batch normalization + ReLU activation combination, the ConvGRU loop network is calculated and output by using the information of the previous frame and the current frame.

Fig. 8 and 9 show a JPU module structure diagram and a step diagram.

According to the JPU module shown in fig. 8 and 9, the JPU module is configured to convert the extracted high-resolution feature map into joint upsampling, and is configured to effectively generate a high-resolution Image given guidance of corresponding low-resolution output (Image LR, Decoder network module output) and high-resolution Image (Image HR) by using three different scales of features obtained by the current video frame (Image HR), the downsampled current video frame (Image LR) and the Decoder network module. The JPU module comprises the following steps:

step S41: performing feature fusion on three features with different scales obtained by a current video frame (Image HR), a current video frame downsampling (Image LR) and a Decoder network module, and outputting a feature graph;

step S42: using separable convolution groups with different cavity rates to increase the visual field, capturing context information, outputting four groups of feature graphs with unchanged resolution, and fusing multi-scale context information through merging (Concatenate);

step S43: an alpha Mongolian layout with the number of channels of 1 is generated by using a 3X 3 2D convolution.

In order to reduce the computational complexity and parameter amount of the convolution operation, the common standard convolution is replaced by a separable convolution group consisting of the hole convolution and the point-by-point convolution with different hole rates. The standard convolution is replaced by a 3 × 3 depth convolution and a 1 × 1 point-by-point convolution through the decoupling operation of the channel correlation and the spatial correlation.

Specifically, the feature fusion is performed on the current video frame, the down-sampled feature of the current video frame and the feature of three different scales obtained by the Decoder network module, and the feature graph output includes the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.

FIG. 10 shows a portrait segmentation effect diagram of a video portrait segmentation algorithm based on a twin network in different scenes.

According to the figure 10, which shows the human image segmentation effect diagram in different scenes, it can be seen that the human image segmentation method can accurately segment the human image edges and further segment the human images under various complex environments such as low foreground and background contrast, complex background, complex main body shape and the like, and has strong robustness. The method can capture edge features, shallow-layer textural features, time sequence, spatial features and other multi-level features, can supplement the time sequence, space and edge structural information of the alpha mask map of the current frame of the video, and can realize accurate prediction of the alpha mask map so as to segment the portrait from the background.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video portrait segmentation algorithm based on a twin network is characterized in that a twin network structure is adopted, the basic structure of the algorithm comprises a video frame image acquisition module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module, and the algorithm comprises the following steps:

step S3: inputting three-channel RGB video frame images into the Encoder network module, and extracting multi-scale coarse-grained characteristics of the five three-channel RGB video frame images by adopting a Mobilene V3 network;

step S4: connecting the Encoder network module and the Decoder network module through the SE module, inputting coarse-grained characteristics into the SE module, and performing characteristic channel-level characteristic recalibration by learning the importance degree of each channel;

step S6: the JPU module is used for carrying out downsampling on a current video frame and the Decoder network module is used for obtaining three characteristics with different scales, and a high-resolution characteristic diagram is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.

2. The twin network based video portrait segmentation algorithm of claim 1, wherein the obtaining, through the video frame obtaining module, a current video frame image from a video to be segmented and performing pre-processing, and obtaining the pre-processed current video frame image comprises: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.

3. A twin network based video portrait segmentation algorithm as claimed in claim 2, wherein the pre-processing of the current video frame comprises: step S121: adjusting the size of the video to be segmented to be a preset size, wherein the preset size is the size of an input image required by the twin network; step S122: normalizing the pixels of the image after the size adjustment; step S123: and adjusting the order of the color channels of the normalized image according to a preset order.

4. The twin network based video portrait segmentation algorithm of claim 1, wherein the three-channel RGB video frame images are input into the Encoder network module, a mobilene V3 network is adopted to extract multi-scale coarse-grained features of five three-channel RGB video frame images, and the multi-scale coarse-grained features comprise a lightweight network mobilene V3Large as a back bone, a four-level Encoder is constructed based on the twin network, and coarse-grained feature maps of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a downsampling layer and the four-level Encoder.

5. The twin network based video portrait segmentation algorithm of claim 4, wherein the downsampling layer performs 4 times downsampling by using bilinear interpolation to obtain a feature map of original image resolution 1/4; the four-stage encoder comprises a first encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.

6. The twin network based video face segmentation algorithm of claim 1, wherein the connecting the Encoder network module and the Decoder network module through the SE module, inputting coarse-grained features into the SE module, and performing feature channel-level feature re-calibration by learning to the importance of each channel comprises: the system is used for converting the obtained coarse-grained features into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (3) performing an Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and calibrating the features again.

7. The twin network based video portrait segmentation algorithm of claim 1, wherein the obtaining, through the Decoder network module, different scale features from the Encoder network module, the current video frame image downsampling and the ConvGRU recurrent neural network, and performing feature fusion to capture edge features lost in the downsampling, shallow layer grammatical features, and time sequence and spatial features, includes gradually restoring and amplifying high-level semantic information through a four-level Decoder corresponding to the Encoder module to obtain a high-resolution feature map.

8. The twin network based video portrait segmentation algorithm of claim 7, wherein the four-level decoder is configured for multi-level feature fusion, channel number reduction and high resolution feature map acquisition, respectively obtaining feature maps of current video frame resolutions 1/32, 1/16, 1/8, 1/4; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.

9. The twin network based video portrait segmentation algorithm of claim 1, wherein the efficient generation of the high resolution feature map given the respective low resolution output and high resolution image by the JPU module to obtain three different scales of features from the current video frame, the current video frame down-sampling, the Decoder network module comprises the steps of S41: performing feature fusion on three features with different scales obtained by the current video frame, the downsampling of the current video frame and the Decoder network module, and outputting a feature map; step S42: using separable convolution groups with different cavity rates to increase the visual field, capturing context information, outputting four groups of feature graphs with unchanged resolution, and merging and fusing multi-scale context information; step S43: and generating an alpha Mongolian layout with the channel number of 1 by using 3X 3 2D convolution on the fused multi-scale context information.

10. The twin network based video portrait segmentation algorithm of claim 9, wherein the feature fusion of the features of the three different scales obtained from the current video frame, the downsampling of the current video frame and the Decoder network module, the outputting the feature map comprises the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.