CN115100409A - Video portrait segmentation algorithm based on twin network - Google Patents

Video portrait segmentation algorithm based on twin network Download PDF

Info

Publication number
CN115100409A
CN115100409A CN202210759308.9A CN202210759308A CN115100409A CN 115100409 A CN115100409 A CN 115100409A CN 202210759308 A CN202210759308 A CN 202210759308A CN 115100409 A CN115100409 A CN 115100409A
Authority
CN
China
Prior art keywords
module
video frame
network
encoder
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210759308.9A
Other languages
Chinese (zh)
Other versions
CN115100409B (en
Inventor
张笑钦
廖唐飞
赵丽
冯士杰
徐曰旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou University
Original Assignee
Wenzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou University filed Critical Wenzhou University
Priority to CN202210759308.9A priority Critical patent/CN115100409B/en
Publication of CN115100409A publication Critical patent/CN115100409A/en
Application granted granted Critical
Publication of CN115100409B publication Critical patent/CN115100409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a video portrait segmentation algorithm based on a twin network, which relates to the technical field of image processing, adopts a twin network structure, and has a basic structure comprising a video frame acquisition image module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module; the invention adopts a deep learning PyTorch frame to construct the module, and a model learning video processing method is used for predicting an accurate alpha mask for each frame of a video and extracting tasks from a given image or video so as to realize high-resolution video portrait segmentation in a complex scene.

Description

Video portrait segmentation algorithm based on twin network
Technical Field
The invention relates to the technical field of image processing, in particular to a video portrait segmentation algorithm based on a twin network and a conveying method thereof.
Background
In computer vision, image semantic segmentation is an important research subject of computer vision, and can be widely applied to various fields, for example, foreground segmentation of images can be used for changing backgrounds of videos, foreground characters can be integrated into different scenes, and creative algorithm application is generated.
The purpose of the human image segmentation is to predict an accurate alpha mask that can be used to extract a human from a given image or video. It has wide applications such as photo editing and movie creation. The video portrait segmentation algorithm aims to predict a video frame alpha covering picture in a complex scene through the video portrait segmentation algorithm to perform foreground and background segmentation. The existing ground real-time high-resolution video portrait segmentation algorithm can obtain high-quality prediction only by means of green cloth, and the algorithm without the green cloth also has some problems, such as data sets need to be subjected to three-division mapping, and the cost for obtaining the three-division mapping is also high.
Therefore, a video portrait segmentation algorithm based on a twin network is urgently needed to solve the problems.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a video portrait segmentation algorithm based on a twin network, which can ensure high-precision segmentation of video portraits in various environments with complex backgrounds, complex main body shapes and the like.
In order to achieve the purpose, the invention designs and realizes a video portrait segmentation algorithm based on a twin network, and a high-resolution alpha Mongolian layout is obtained through twin network sharing weight, cyclic neural network capturing time sequence and spatial characteristics and combined up-sampling. The technical scheme is as follows: a video portrait segmentation algorithm based on a twin network adopts a twin network structure, the basic structure of which comprises a video frame image acquisition module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module, and the method comprises the following steps:
step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: separating the obtained preprocessed video frame image into three-channel RGB video frame images in an RGB color mode through the RGB separation module;
step S3: inputting three-channel RGB video frame images through the Encoder network module, and extracting multi-scale coarse-grained characteristics of the five three-channel RGB video frame images by adopting a Mobilene V3 network;
step S4: connecting the Encoder network module and the Decoder network module through the SE module, and re-calibrating characteristics by learning the importance degree of each channel;
step S5: obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU circulating neural network through the Decoder network module, performing characteristic fusion, capturing edge characteristics lost in downsampling, shallow-layer textural characteristics and time sequence and spatial characteristics, and obtaining a high-resolution characteristic diagram;
step S6: three characteristics of different scales are obtained from a current video frame, a current video frame downsampling and Decoder network module through the JPU module, and a high-resolution characteristic map is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.
Further, the obtaining, by the video frame obtaining module, a current video frame image from a video to be segmented and performing preprocessing to obtain a preprocessed current video frame image includes: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.
Furthermore, the current video frame image is obtained from the video to be segmented through the video frame obtaining module and is preprocessed, so that the preprocessed current frequency frame image is obtained.
Further, the three-channel RGB video frame images are input to pass through the Encoder network module, a Mobilene V3 network is adopted, multi-scale coarse-grained characteristics of five three-channel RGB video frame images are extracted, a light-weight network Mobilene V3Large is adopted as a back bone, a four-level Encoder is built on the basis of a twin network, and coarse-grained characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a down-sampling layer and the four-level Encoder.
Furthermore, the Encoder network module comprises a down-sampling layer and a four-level coder, wherein the down-sampling layer adopts bilinear interpolation to carry out 4 times down-sampling to obtain a feature map of the original image resolution 1/4; the four-stage encoder comprises a first encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.
Further, the connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained features into the SE module, and performs feature recalibration at a feature channel level by learning the importance degree of each channel, including: the system is used for converting the obtained coarse-grained features into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (4) performing Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and re-calibrating the features.
Further, the method includes the steps of obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU circulating neural network through the Decoder network module, performing characteristic fusion, capturing edge characteristics lost in the downsampling, superficial layer textural characteristics and time sequence and space characteristics, and gradually restoring and amplifying high-layer semantic information through a four-stage Decoder corresponding to the Encoder module to obtain a high-resolution characteristic diagram.
Furthermore, the four-level decoder is used for multi-level feature fusion, channel number reduction and high-resolution feature map obtaining of feature maps of current video frame resolutions 1/32, 1/16, 1/8 and 1/4; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.
Further, the efficient generation of a high resolution feature map given the corresponding low resolution output and high resolution image by the JPU module to obtain three different scales of features from the current video frame, the current video frame down-sampling, the Decoder network module comprises the steps of S41: performing feature fusion on three features with different scales obtained by the current video frame, the downsampling of the current video frame and the Decoder network module, and outputting a feature map; step S42: using separable convolution groups with different cavity rates to enlarge a visual field, capturing context information, outputting four groups of feature maps with unchanged resolution, and merging and fusing multi-scale context information; step S43: and generating an alpha Mongolian layout with the channel number of 1 by using 3X 3 2D convolution on the fused multi-scale context information.
Furthermore, the feature fusion is performed on the features of the current video frame, the downsampling of the current video frame and the three different scales obtained by the Decoder network module, and the outputting of the feature map comprises the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.
The technical scheme can show that the invention has the advantages that:
compared with the prior art, the method can capture edge features, shallow texture features, time sequence, spatial features and other multi-level features, can supplement the time sequence, space and edge structural information of the alpha mask map of the current frame of the video, and realizes accurate prediction of the alpha mask, thereby segmenting the portrait from the background. The invention obtains accurate segmentation of portrait edge in various complex environments with low contrast between foreground and background, complex main body shape and the like, and has stronger robustness.
The method can perform high-precision video portrait segmentation on targets under complex scenes such as multiple targets, target shielding, tiny targets and fast target movement, and adopts a deep learning PyTorch frame to construct a learning video processing method of each model. The pre-training model of the video portrait segmentation algorithm based on the twin network is superior to other algorithms in terms of index results and visual effects on a test data set.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a video portrait segmentation algorithm based on a twin network according to the present invention.
FIG. 2 is a schematic diagram of the overall network structure of a video portrait segmentation algorithm based on a twin network according to the present invention.
FIG. 3 is a diagram illustrating a step of obtaining a video frame image according to the present invention.
FIG. 4 is a diagram illustrating a pre-processing procedure for a current video frame according to the present invention.
Fig. 5 is a schematic structural diagram of an Encoder network module bottleeck according to the present invention.
Fig. 6 is a detailed network structure diagram of an Encoder network module according to the present invention.
Fig. 7 is a schematic structural diagram of a Decoder module according to the present invention.
FIG. 8 is a schematic diagram of a JPU module according to the present invention.
FIG. 9 is a step diagram of a JPU module according to the present invention.
FIG. 10 is a diagram illustrating the effect of video image segmentation in different scenes by the video image segmentation algorithm based on the twin network according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments and the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
The invention designs and realizes a video portrait segmentation algorithm based on a twin network, performs multi-scale feature fusion based on the twin network, guides the algorithm network to capture edge features, superficial texture features, time sequence and space features, and can rapidly and accurately perform video portrait segmentation.
Fig. 1 and 2 show a flow chart and a general network structure diagram of a video portrait segmentation algorithm based on a twin network.
According to the video portrait segmentation algorithm based on the twin network shown in the figures 1 and 2, the video portrait segmentation algorithm based on the twin network aims to obtain a more accurate alpha Mongolian layout by constructing the video portrait segmentation algorithm and combining semantic information, time sequence and spatial information and structural detail information of the video portrait. The video portrait segmentation algorithm based on the twin network adopts a twin network structure, the basic structure of the algorithm comprises a video frame acquisition image, an RGB separation module, an Encoder network module, an SE (squeeze and excitation) module, a Decoder network module and a JPU (Joint Pyramid) module, and the method specifically comprises the following steps:
step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: separating the obtained preprocessed video frame image into three-channel RGB video frame images in an RGB color mode through the RGB separation module;
step S3: inputting three-channel RGB video frame images through the Encoder network module, and extracting multi-scale coarse-grained characteristics of the five three-channel RGB video frame images by adopting a Mobilene V3 network;
step S4: connecting the Encoder network module and the Decoder network module through the SE module, and re-calibrating characteristics by learning the importance degree of each channel;
step S5: obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU recurrent neural network through the Decoder network module, performing characteristic fusion, and capturing edge characteristics lost in downsampling, shallow layer textural characteristics and time sequence and spatial characteristics;
step S6: the JPU module obtains three characteristics with different scales from a current video frame, a current video frame downsampling and a Decoder network module, and a high-resolution characteristic map is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.
Fig. 3 and 4 show diagrams of steps for obtaining video frame images.
According to the obtaining of the video frame image shown in fig. 3, the obtaining and preprocessing the current video frame image of the video to be segmented includes:
step S11: acquiring a current video frame image of a video to be segmented;
step S12: and preprocessing the acquired current video frame image.
According to the preprocessing of the current video frame shown in fig. 4, the preprocessing of the current video frame includes:
step S121: adjusting the size of the video to be segmented to be a preset size, wherein the preset size is the size of an input image required by the twin network;
step S122: normalizing the pixels of the image after the size adjustment;
step S123: and adjusting the order of the color channels of the normalized image according to a preset order.
The current video frame is processed into an image form suitable for a twin network structure by preprocessing the current video frame, so that the image input is facilitated and the division is accurate.
Fig. 5 and fig. 6 show an Encoder network module bottleeck structure schematic diagram and a detailed network schematic diagram.
According to the Encoder network module shown in FIG. 5 and FIG. 6, the Encoder network module is of a twin network structure, a lightweight network Mobilene V3Large specially designed for semantic segmentation is selected as a back bone, and a four-level Encoder is constructed based on the twin network.
The three-channel RGB video frame images are input to pass through the Encoder network module, the multi-scale coarse-grained characteristics of five three-channel RGB video frame images are extracted by adopting a Mobilene V3 network, a lightweight network Mobilene V3Large is adopted as a back bone, a four-level Encoder is constructed on the basis of a twin network, and coarse-grained characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a down-sampling layer and the four-level Encoder.
The down-sampling layer adopts bilinear interpolation to carry out 4 times down-sampling, so that a feature map of original image resolution 1/4 of the current video frame is obtained; the four-stage encoder comprises a first-stage encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits the shallow features containing the structured information to the deep features through short links.
The four-level Encoder includes a first-level Encoder Encode _ Blk1, a second-level Encoder Encode _ Blk2, a third-level Encoder Encode _ Blk3 and a fourth-level Encoder Encode _ Blk4, the first-level Encoder Encode _ Blk1, the second-level Encoder Encode _ Blk2, the third-level Encoder Encode _ Blk3 and the fourth-level Encoder Encode _ Blk4 use a plurality of weight-shared bittleneck structure, the bittleneck is a reverse residual structure, firstly a point-by-point convolution group (2D convolution of 1 × 1 + batch processing + activation layer), secondly a depth convolution group (2D convolution of 3 × 3 + batch processing + activation layer), and a SE module learning weight are connected, and finally a shallow layer feature containing structured detail information is transferred to a deep layer feature through a short shortcut link.
Specifically, the first-level Encoder _ Blk1 includes two bottomleneck blocks to obtain a feature map of the original image resolution 1/8, the second-level Encoder _ Blk2 includes two bottomleneck blocks to obtain a feature map of the original image resolution 1/16, the third-level Encoder _ Blk3 includes three bottomleneck blocks to obtain a feature map of the original image resolution 1/32, and the fourth-level Encoder _ Blk4 includes six bottomleneck blocks to obtain a feature map of the original image resolution 1/64, so as to obtain a multi-scale coarse-grained feature of five three-channel RGB video frame resolutions.
The connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained characteristics into the SE module, and recalibrates the characteristic channel-level characteristics by learning the importance degree of each channel, including: the method is used for converting the obtained coarse-grained feature into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (4) performing Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and re-calibrating the features. And the SE module is used for obtaining the weight coefficient of each channel, so that the model has higher discrimination capability on the characteristics of each channel.
Fig. 7 shows a detailed network structure diagram of a Decoder network module.
According to a Decoder network module shown in fig. 7, the Decoder network module is a twin network, weight sharing is performed among a plurality of Decoder blocks, and a pseudo-twin network is formed with the Encoder network module. The Decoder network module is used for obtaining four different scale characteristics from a current video frame, the Encoder module, a downsampling result (Image LR) of the current video frame and a ConvGRU circulating neural network and performing characteristic fusion to obtain edge characteristics lost in downsampling, shallow-layer textural characteristics and characteristics based on time sequence and space.
In order to reduce the number of parameters and calculation, the four-level decoder corresponding to the Encoder module splits the input on the channel dimension, the ConvGRU recurrent neural networks in all the modules calculate by using the split characteristics, and the rest are merged with the result through short links. The four-level decoder is used for multi-layer feature fusion, channel number reduction and high-resolution feature map obtaining, and feature maps of current video frame resolutions 1/32, 1/16, 1/8 and 1/4 are obtained respectively; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.
Specifically, the four-stage decoder comprises a 3 × 3 2D convolution + batch normalization + ReLU activation combination, a ConvGRU loop network and 2-fold bilinear interpolation upsampling, the input of the decoder is similar to the conventional U-net structural upsampling, the input of the decoder is combined by using the output of the downsampling process, and after the 3 × 3 2D convolution + batch normalization + ReLU activation combination, the ConvGRU loop network is calculated and output by using the information of the previous frame and the current frame.
Fig. 8 and 9 show a JPU module structure diagram and a step diagram.
According to the JPU module shown in fig. 8 and 9, the JPU module is configured to convert the extracted high-resolution feature map into joint upsampling, and is configured to effectively generate a high-resolution Image given guidance of corresponding low-resolution output (Image LR, Decoder network module output) and high-resolution Image (Image HR) by using three different scales of features obtained by the current video frame (Image HR), the downsampled current video frame (Image LR) and the Decoder network module. The JPU module comprises the following steps:
step S41: performing feature fusion on three features with different scales obtained by a current video frame (Image HR), a current video frame downsampling (Image LR) and a Decoder network module, and outputting a feature graph;
step S42: using separable convolution groups with different cavity rates to increase the visual field, capturing context information, outputting four groups of feature graphs with unchanged resolution, and fusing multi-scale context information through merging (Concatenate);
step S43: an alpha Mongolian layout with the number of channels of 1 is generated by using a 3X 3 2D convolution.
In order to reduce the computational complexity and parameter amount of the convolution operation, the common standard convolution is replaced by a separable convolution group consisting of the hole convolution and the point-by-point convolution with different hole rates. The standard convolution is replaced by a 3 × 3 depth convolution and a 1 × 1 point-by-point convolution through the decoupling operation of the channel correlation and the spatial correlation.
Specifically, the feature fusion is performed on the current video frame, the down-sampled feature of the current video frame and the feature of three different scales obtained by the Decoder network module, and the feature graph output includes the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.
FIG. 10 shows a portrait segmentation effect diagram of a video portrait segmentation algorithm based on a twin network in different scenes.
According to the figure 10, which shows the human image segmentation effect diagram in different scenes, it can be seen that the human image segmentation method can accurately segment the human image edges and further segment the human images under various complex environments such as low foreground and background contrast, complex background, complex main body shape and the like, and has strong robustness. The method can capture edge features, shallow-layer textural features, time sequence, spatial features and other multi-level features, can supplement the time sequence, space and edge structural information of the alpha mask map of the current frame of the video, and can realize accurate prediction of the alpha mask map so as to segment the portrait from the background.
The method can perform high-precision video portrait segmentation on targets under complex scenes such as multiple targets, target shielding, tiny targets and fast target movement, and adopts a deep learning PyTorch frame to construct a learning video processing method of each model. The pre-training model of the video portrait segmentation algorithm based on the twin network is superior to other algorithms in terms of index results and visual effects on a test data set.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A video portrait segmentation algorithm based on a twin network is characterized in that a twin network structure is adopted, the basic structure of the algorithm comprises a video frame image acquisition module, an RGB separation module, an Encoder network module, an SE module, a Decoder network module and a JPU module, and the algorithm comprises the following steps:
step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: separating the obtained preprocessed video frame image into three-channel RGB video frame images in an RGB color mode through the RGB separation module;
step S3: inputting three-channel RGB video frame images into the Encoder network module, and extracting multi-scale coarse-grained characteristics of the five three-channel RGB video frame images by adopting a Mobilene V3 network;
step S4: connecting the Encoder network module and the Decoder network module through the SE module, inputting coarse-grained characteristics into the SE module, and performing characteristic channel-level characteristic recalibration by learning the importance degree of each channel;
step S5: obtaining different scale characteristics from the Encoder network module, the current video frame image downsampling and the ConvGRU recurrent neural network through the Decoder network module, performing characteristic fusion, and capturing edge characteristics lost in downsampling, shallow layer textural characteristics and time sequence and spatial characteristics;
step S6: the JPU module is used for carrying out downsampling on a current video frame and the Decoder network module is used for obtaining three characteristics with different scales, and a high-resolution characteristic diagram is effectively generated under the condition of giving corresponding low-resolution output and high-resolution images.
2. The twin network based video portrait segmentation algorithm of claim 1, wherein the obtaining, through the video frame obtaining module, a current video frame image from a video to be segmented and performing pre-processing, and obtaining the pre-processed current video frame image comprises: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.
3. A twin network based video portrait segmentation algorithm as claimed in claim 2, wherein the pre-processing of the current video frame comprises: step S121: adjusting the size of the video to be segmented to be a preset size, wherein the preset size is the size of an input image required by the twin network; step S122: normalizing the pixels of the image after the size adjustment; step S123: and adjusting the order of the color channels of the normalized image according to a preset order.
4. The twin network based video portrait segmentation algorithm of claim 1, wherein the three-channel RGB video frame images are input into the Encoder network module, a mobilene V3 network is adopted to extract multi-scale coarse-grained features of five three-channel RGB video frame images, and the multi-scale coarse-grained features comprise a lightweight network mobilene V3Large as a back bone, a four-level Encoder is constructed based on the twin network, and coarse-grained feature maps of 1/4, 1/8, 1/16, 1/32 and 1/64 of the three-channel RGB video frame resolution are obtained through a downsampling layer and the four-level Encoder.
5. The twin network based video portrait segmentation algorithm of claim 4, wherein the downsampling layer performs 4 times downsampling by using bilinear interpolation to obtain a feature map of original image resolution 1/4; the four-stage encoder comprises a first encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each stage of encoder adopts a bottleeck structure shared by a plurality of weights, each stage of encoder firstly uses a point-by-point convolution group, secondly uses a deep convolution group, is connected with an SE (sequence analysis) module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.
6. The twin network based video face segmentation algorithm of claim 1, wherein the connecting the Encoder network module and the Decoder network module through the SE module, inputting coarse-grained features into the SE module, and performing feature channel-level feature re-calibration by learning to the importance of each channel comprises: the system is used for converting the obtained coarse-grained features into a global feature through an Squeeze operation, and the global feature is obtained by adopting global averaging; and (3) performing an Excitation operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining the weights of different channels, and calibrating the features again.
7. The twin network based video portrait segmentation algorithm of claim 1, wherein the obtaining, through the Decoder network module, different scale features from the Encoder network module, the current video frame image downsampling and the ConvGRU recurrent neural network, and performing feature fusion to capture edge features lost in the downsampling, shallow layer grammatical features, and time sequence and spatial features, includes gradually restoring and amplifying high-level semantic information through a four-level Decoder corresponding to the Encoder module to obtain a high-resolution feature map.
8. The twin network based video portrait segmentation algorithm of claim 7, wherein the four-level decoder is configured for multi-level feature fusion, channel number reduction and high resolution feature map acquisition, respectively obtaining feature maps of current video frame resolutions 1/32, 1/16, 1/8, 1/4; the input of each stage of decoder is merged by using the output of the down sampling process, and after convolution normalization, the previous frame and the current frame information are used for calculation and output by a ConvGRU circulating network.
9. The twin network based video portrait segmentation algorithm of claim 1, wherein the efficient generation of the high resolution feature map given the respective low resolution output and high resolution image by the JPU module to obtain three different scales of features from the current video frame, the current video frame down-sampling, the Decoder network module comprises the steps of S41: performing feature fusion on three features with different scales obtained by the current video frame, the downsampling of the current video frame and the Decoder network module, and outputting a feature map; step S42: using separable convolution groups with different cavity rates to increase the visual field, capturing context information, outputting four groups of feature graphs with unchanged resolution, and merging and fusing multi-scale context information; step S43: and generating an alpha Mongolian layout with the channel number of 1 by using 3X 3 2D convolution on the fused multi-scale context information.
10. The twin network based video portrait segmentation algorithm of claim 9, wherein the feature fusion of the features of the three different scales obtained from the current video frame, the downsampling of the current video frame and the Decoder network module, the outputting the feature map comprises the following steps: firstly, performing 3X 3 2D convolution operation to unify the number of input three characteristic channels, secondly, performing up-sampling operation to uniformly restore to a high-resolution characteristic scale, and finally outputting a characteristic diagram with the resolution consistent with that of the current video frame.
CN202210759308.9A 2022-06-30 2022-06-30 Video portrait segmentation algorithm based on twin network Active CN115100409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210759308.9A CN115100409B (en) 2022-06-30 2022-06-30 Video portrait segmentation algorithm based on twin network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210759308.9A CN115100409B (en) 2022-06-30 2022-06-30 Video portrait segmentation algorithm based on twin network

Publications (2)

Publication Number Publication Date
CN115100409A true CN115100409A (en) 2022-09-23
CN115100409B CN115100409B (en) 2024-04-26

Family

ID=83295324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210759308.9A Active CN115100409B (en) 2022-06-30 2022-06-30 Video portrait segmentation algorithm based on twin network

Country Status (1)

Country Link
CN (1) CN115100409B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116205928A (en) * 2023-05-06 2023-06-02 南方医科大学珠江医院 Image segmentation processing method, device and equipment for laparoscopic surgery video and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287940A (en) * 2020-10-30 2021-01-29 西安工程大学 Semantic segmentation method of attention mechanism based on deep learning
WO2021088300A1 (en) * 2019-11-09 2021-05-14 北京工业大学 Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network
CN114299944A (en) * 2021-12-08 2022-04-08 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021088300A1 (en) * 2019-11-09 2021-05-14 北京工业大学 Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network
CN112287940A (en) * 2020-10-30 2021-01-29 西安工程大学 Semantic segmentation method of attention mechanism based on deep learning
CN114299944A (en) * 2021-12-08 2022-04-08 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
洪庆;宋乔;杨晨涛;张培;常连立;: "基于智能视觉的机械零件图像分割技术", 机械制造与自动化, no. 05, 20 October 2020 (2020-10-20) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116205928A (en) * 2023-05-06 2023-06-02 南方医科大学珠江医院 Image segmentation processing method, device and equipment for laparoscopic surgery video and medium
CN116205928B (en) * 2023-05-06 2023-07-18 南方医科大学珠江医院 Image segmentation processing method, device and equipment for laparoscopic surgery video and medium

Also Published As

Publication number Publication date
CN115100409B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN110287849B (en) Lightweight depth network image target detection method suitable for raspberry pi
CN113052210B (en) Rapid low-light target detection method based on convolutional neural network
CN112733950A (en) Power equipment fault diagnosis method based on combination of image fusion and target detection
CN111861880B (en) Image super-fusion method based on regional information enhancement and block self-attention
CN112686207B (en) Urban street scene target detection method based on regional information enhancement
CN113222124B (en) SAUNet + + network for image semantic segmentation and image semantic segmentation method
CN111429466A (en) Space-based crowd counting and density estimation method based on multi-scale information fusion network
CN111369565A (en) Digital pathological image segmentation and classification method based on graph convolution network
CN111951288A (en) Skin cancer lesion segmentation method based on deep learning
CN116797787B (en) Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network
WO2023138629A1 (en) Encrypted image information obtaining device and method
CN116486074A (en) Medical image segmentation method based on local and global context information coding
CN115100409A (en) Video portrait segmentation algorithm based on twin network
CN113838102B (en) Optical flow determining method and system based on anisotropic dense convolution
CN112906675B (en) Method and system for detecting non-supervision human body key points in fixed scene
CN117409244A (en) SCKConv multi-scale feature fusion enhanced low-illumination small target detection method
CN112232221A (en) Method, system and program carrier for processing human image
CN116091793A (en) Light field significance detection method based on optical flow fusion
Schirrmacher et al. SR 2: Super-resolution with structure-aware reconstruction
CN115330655A (en) Image fusion method and system based on self-attention mechanism
CN115731280A (en) Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network
CN111950496B (en) Mask person identity recognition method
US11790633B2 (en) Image processing using coupled segmentation and edge learning
CN113192018A (en) Water-cooled wall surface defect video identification method based on fast segmentation convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant