CN116797975A

CN116797975A - Video segmentation method, device, computer equipment and storage medium

Info

Publication number: CN116797975A
Application number: CN202310773090.7A
Authority: CN
Inventors: 杨志雄; 杨延展
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2023-09-22

Abstract

The present disclosure provides a video segmentation method, apparatus, computer device, and storage medium, including: acquiring a video to be segmented and segmentation prompt information corresponding to the video to be segmented; the segmentation prompt information is used for representing the segmentation requirement of the video to be segmented; encoding the video frames of the video to be segmented based on an image encoder to obtain first encoding features; and encoding the segmentation prompt information based on a prompt information encoder to obtain a second encoding characteristic; inputting the first coding feature and the second coding feature into a decoder, respectively extracting time features to obtain a first feature, and extracting space features to obtain a second feature; and determining a segmentation mask image corresponding to the video frame based on the first feature and the second feature, and carrying out segmentation processing on the video frame based on the segmentation mask image.

Description

Video segmentation method, device, computer equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a video segmentation method, a video segmentation device, computer equipment and a storage medium.

Background

With the development of neural networks, more networks are used for image segmentation, and the accuracy of image segmentation is also higher. Although the video is formed by combining multiple frames of image frames, the segmentation requirement of the video is obviously different from that of the image, the image segmentation only needs to consider the relation among all objects of the image, but the video segmentation also needs to consider the association relation among all image frames, and if the network for image segmentation is directly applied to the field of video segmentation, the segmentation effect is poor.

Disclosure of Invention

The embodiment of the disclosure at least provides a video segmentation method, a video segmentation device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a video segmentation method, including:

acquiring a video to be segmented and segmentation prompt information corresponding to the video to be segmented; the segmentation prompt information is used for representing the segmentation requirement of the video to be segmented;

encoding the video frames of the video to be segmented based on an image encoder to obtain first encoding features; and encoding the segmentation prompt information based on a prompt information encoder to obtain a second encoding characteristic;

Inputting the first coding feature and the second coding feature into a decoder, respectively extracting time features to obtain a first feature, and extracting space features to obtain a second feature;

and determining a segmentation mask image corresponding to the video frame based on the first feature and the second feature, and carrying out segmentation processing on the video frame based on the segmentation mask image.

In a possible implementation manner, the video frame of the video to be segmented is a video frame obtained after sampling and frame extracting; the image-based encoder encodes the video frames of the video to be segmented to obtain first encoding features, including:

for any video frame, dividing the video frame into a plurality of image blocks;

determining an embedded characterization vector of each image block, and determining the embedded characterization vector of the video frame based on the embedded characterization vector of each image block;

the first encoding feature is determined based on the embedded characterization vector for each video frame.

In a possible implementation manner, the inputting the first coding feature and the second coding feature into a decoder includes:

splicing the first coding feature and the second coding feature to obtain a third coding feature;

The third encoding feature is input into the decoder.

In a possible implementation manner, the decoder is configured to perform the spatial feature extraction by:

and inputting the third coding feature into an attention mechanism model containing a fine-tuning structural layer to obtain the second feature.

In a possible implementation manner, the inputting the third coding feature into an attention mechanism model including a fine tuning structure layer to obtain the second feature includes:

normalizing the third coding feature based on a first normalization layer to obtain a first normalization feature;

inputting the first normalization feature to the fine adjustment structure layer to obtain a fine adjustment feature; inputting the first normalized feature to a multi-head self-attention module for feature extraction to obtain an intermediate feature;

fusing the first normalization feature, the intermediate feature and the fine tuning feature to obtain a first fusion feature;

normalizing the first fusion feature based on a second normalization layer to obtain a second normalization feature;

and inputting the second normalized feature into a multi-layer sensor to obtain the second feature.

In a possible implementation, the decoder is configured to perform the temporal feature extraction by:

performing first channel adjustment on the third coding feature to obtain an adjustment feature;

inputting the adjustment feature into an attention mechanism model comprising a fine adjustment structure layer to obtain a third feature;

and performing second channel adjustment on the third characteristic to obtain the first characteristic.

In a possible implementation manner, the determining the segmentation mask image corresponding to the video frame based on the first feature and the second feature includes:

fusing the first feature and the second feature to obtain a second fused feature;

and determining a segmentation mask image corresponding to the video frame based on the second fusion feature.

In a second aspect, an embodiment of the present disclosure further provides a video segmentation apparatus, including:

the acquisition module is used for acquiring the video to be segmented and segmentation prompt information corresponding to the video to be segmented; the segmentation prompt information is used for representing the segmentation requirement of the video to be segmented;

the coding module is used for coding the video frames of the video to be segmented based on an image coder to obtain a first coding characteristic; and encoding the segmentation prompt information based on a prompt information encoder to obtain a second encoding characteristic;

The decoding module is used for inputting the first coding feature and the second coding feature into a decoder, respectively extracting the time feature to obtain a first feature, and extracting the space feature to obtain a second feature;

and the determining module is used for determining a segmentation mask image corresponding to the video frame based on the first feature and the second feature, and carrying out segmentation processing on the video frame based on the segmentation mask image.

In a third aspect, embodiments of the present disclosure further provide a computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect, or any of the possible implementations of the first aspect.

In a fourth aspect, the presently disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the first aspect, or any of the possible implementations of the first aspect.

According to the video segmentation method, the video segmentation device, the computer equipment and the storage medium, when a video to be segmented is segmented, after a video frame and segmentation prompt information are coded, feature extraction can be performed from a space domain and a time domain respectively, then a segmentation mask image corresponding to the video frame is determined based on a first feature extracted from the time feature and a second feature extracted from the space feature, and segmentation processing is performed based on the segmentation mask image. In this way, the association relation of each video frame in the time domain is considered when the feature extraction is performed, so that the segmentation mask image obtained by the method has more accurate segmentation result when the video frames of the video are segmented.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 illustrates a schematic architecture of a SAM model provided by an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a video segmentation method provided by an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of the internal structure of a LoRA provided by an embodiment of the present disclosure;

FIG. 4a illustrates the internal structure of a spatial feature extraction module provided by an embodiment of the present disclosure;

FIG. 4b shows a schematic diagram of the decoder internal structure of the SAM model provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing an internal structure of a temporal feature extraction module according to an embodiment of the present disclosure;

FIG. 6 illustrates an overall architecture diagram of a video segmentation model provided by an embodiment of the present disclosure;

fig. 7 shows a schematic architecture of a video segmentation apparatus according to an embodiment of the disclosure;

fig. 8 shows a schematic structural diagram of a computer device according to an embodiment of the disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

The video segmentation in the present disclosure may be understood as segmenting video frames in video, however, unlike image segmentation, video frame segmentation needs to consider the association relationship between frames. For example, a dynamic object in a video frame may have a problem of shielding in a part of the video frame, and if the video frame is divided separately, the association relationship between the frames is not considered, so that the division result of the dynamic object in different video frames may be different.

For example, if a cat moving dynamically is included in the video, it may be possible that in some video frames the cat is segmented as a single entity, while in some video frames the cat may be occluded from other objects such as a table and the cat is segmented from other objects into an entity.

In order to avoid this, it is proposed in the related art to divide the optical flow information from the time domain to the spatial domain (e.g., a dual-flow network), however, in this technology, optical flow information between frames is generally calculated, however, calculation of the optical flow information is complex, and thus, the calculation speed and the calculation accuracy of this method are also low.

Based on this, when the video to be segmented is segmented, after encoding the video frame and the segmentation prompt information, the video segmentation method, the device, the computer equipment and the storage medium according to the embodiments of the present disclosure may perform feature extraction from the spatial domain and the temporal domain, determine a segmentation mask image corresponding to the video frame based on the first feature extracted from the temporal feature and the second feature extracted from the spatial feature, and perform segmentation processing based on the segmentation mask image. In this way, the association relation of each video frame in the time domain is considered when the feature extraction is carried out, so that the segmentation mask image obtained by the method has more accurate segmentation result when the video frames of the video are segmented; in addition, in the method, when the feature extraction is performed in the space field, the feature extraction is performed by a prompt learning method instead of the optical flow information extraction, and the calculation speed and the calculation precision are high.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The term "and/or" is used herein to describe only one relationship, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

For the sake of understanding the present embodiment, first, a detailed description will be given of a video segmentation method disclosed in the embodiments of the present disclosure, where an execution body of the video segmentation method provided in the embodiments of the present disclosure is generally a server.

The video segmentation model described in the present disclosure may be an improvement on the basis of a everything segmentation model (Segment Anything Model, SAM). As shown in fig. 1, the input of the SAM model includes an image to be segmented and segmentation extraction information, after the input of the SAM model, the image to be segmented may be input to an image encoder, then segmentation hint information is input to a hint encoder, and then input to a decoder for decoding, so as to obtain a segmentation mask image, and then the image to be segmented is segmented based on the segmentation mask image.

As is clear from the above-described structure, the SAM model only takes into consideration the characteristics of the spatial domain when decoding, and decodes only in the spatial domain, and therefore, if the SAM model is directly applied to video segmentation, the segmentation effect is poor.

The video segmentation model of the present disclosure is an improvement to the decoding module of the SAM model, and the structure diagram of the decoding module will be described below.

Referring to fig. 2, a flowchart of a video segmentation method according to an embodiment of the disclosure is shown, where the method includes steps 201 to 204, where:

step 201, obtaining a video to be segmented and segmentation prompt information corresponding to the video to be segmented; the segmentation prompt information is used for representing the segmentation requirement of the video to be segmented.

Step 202, encoding the video frames of the video to be segmented based on an image encoder to obtain first encoding characteristics; and encoding the segmentation prompt information based on the prompt information encoder to obtain a second encoding characteristic.

Step 203, inputting the first coding feature and the second coding feature into a decoder, and respectively performing temporal feature extraction to obtain a first feature and spatial feature extraction to obtain a second feature.

Step 204, determining a segmentation mask image corresponding to the video frame based on the first feature and the second feature, and performing segmentation processing on the video frame based on the segmentation mask image.

The following is a detailed description of the above steps.

For step 201,

The obtaining the video to be segmented may be obtaining a video uploaded by the user or obtaining a video selected by the user from a local place. Alternatively, the video length of the video to be segmented may be fixed, for example, may be 10s.

The segmentation prompt information is used for representing the segmentation requirement of the video to be segmented, and the segmentation requirement can comprise, for example, a specific entity in the video to be segmented, an entity with certain characteristics in the video to be segmented, or an entity of a certain category in the video to be segmented.

The segmentation prompt information can be information input by a user, for example, text information input by the user, or entity position information selected in a video to be segmented by the user.

In another possible implementation manner, the segmentation hint information may be extracted from other segmented videos. For example, a user may input a video to be segmented and a reference video, which is a segmented video, at the same time, and then may determine segmentation hint information from a segmentation result of the reference video.

Specifically, the present disclosure is not limited to other methods for obtaining the segmentation hint information.

Aiming at the step 202,

Because the difference between adjacent video frames is small in the video to be segmented, in order to increase the segmentation speed, the video to be segmented may be sampled and framed, and then the encoding and subsequent decoding processes in step 202 may be performed on the video frames after the sampled and framed.

It should be noted that, in the present disclosure, the steps of encoding by the image encoder and the hint information encoder, and decoding by the decoder may be performed by the video segmentation model, and the steps 202 and the following may describe operations performed inside the model of the video segmentation model.

In a possible implementation manner, when the image encoder encodes the video frame of the video to be segmented (which may refer to a video frame after sampling and frame extracting here), the following steps may be performed:

step a1, for any video frame, dividing the video frame into a plurality of image blocks.

Step a2, determining an embedded characterization vector of each image block, and determining the embedded characterization vector of the video frame based on the embedded characterization vector of each image block.

And a3, determining the first coding characteristic based on the embedded characterization vector of each video frame.

Specifically, for any video frame, when the video frame is divided into a plurality of image blocks (patches), the division may be performed according to a preset image block size. For example, the image block may be square with a length P, and the video frame may be h×w, for example, and the number of divided image blocks n=h×w/P ² 。

Alternatively, in order to ensure that the number of image blocks after division is a positive integer, the size of the video frame may be limited, for example, a preset size may be set, or the video frame may be processed into a preset size, where the preset size is an integral multiple of the area of the image blocks.

Video frame x e R ^H*W*C Wherein h×w represents the size of the video frame, C represents the number of channels, the number of channels of the video frame is 3, and after the video frame is segmented, the size of each image block is p×p×c, and when determining the embedded token vector of each image block, each image block may be mapped into an embedded token vector (ebedding) of D dimension through the linear mapping layer of the image encoder.

After determining the embedded token vector for each image block, the embedded token vectors for each image block may be combined to obtain an embedded token vector for a video frame, where the embedded token vector for the video frame is x _p ∈R ^N*D Where N represents the number of image blocks.

When determining the first coding feature based on the embedded characterization vector of each video frame, the embedded characterization vector of each video frame may be spliced according to the position sequence of each video frame in the video to be segmented, so as to obtain the first coding feature.

In addition, in order to distinguish the embedded characterization vectors of each video frame, a leachable flag bit vector x can be spliced after the embedded characterization vectors of each video frame _class I.e. when determining the first coding feature, is based on a combined vector x of the flag bit vector and the embedded token vector ₀ ＝[x _class ；x _p ]∈R ^(N+1)*D If the number of video frames is T, the dimension of the first coding feature vector is z ₀ ∈R ^T*(N+1)*D 。

The coding process of the hint information encoder for the partition hint information is similar to the coding process in the SAM model and will not be described here.

The image encoder is consistent with the encoding process and parameters of the image encoder in the SAM model, the hint information encoder is consistent with the encoding process and parameters of the hint information encoder in the SAM model, and the video segmentation model can be regarded as a fine tuning model for the SAM model in the present disclosure.

For step 203,

The decoder may be a unit for performing temporal feature extraction and spatial feature extraction, and in a possible implementation, the decoder may include a temporal feature extraction module and a spatial feature extraction module for performing temporal feature extraction and spatial feature extraction, respectively.

Optionally, after the first coding feature and the second coding feature are input into the decoder, the first coding feature and the second coding feature may be spliced (for example, splicing may be performed through a connect operation) to obtain a third coding feature, and then the third coding feature is input into a temporal feature extraction module and a spatial feature extraction module of the decoder to perform feature extraction.

After inputting the third encoding feature to the decoder, operations mainly include:

1. and (5) extracting spatial characteristics.

In the process of extracting the spatial features, the association relation among entities in the video frame needs to be considered, so that more aspects of feature information can be acquired through a multi-head attention mechanism to extract the spatial features.

Optionally, the third encoded feature may be input into an attention mechanism model comprising a fine-tuning structural layer, resulting in the second feature.

Specifically, after inputting the third coding feature into the attention mechanism model including the fine-tuning structural layer, the following operations may be performed to obtain the second feature, including:

step b1, carrying out normalization processing on the third coding feature based on a first normalization layer to obtain a first normalization feature;

Step b2, inputting the first normalization feature into the fine adjustment structure layer to obtain a fine adjustment feature; inputting the first normalized feature to a multi-head self-attention module for feature extraction to obtain an intermediate feature;

step b3, fusing the first normalization feature, the intermediate feature and the fine tuning feature to obtain a first fusion feature;

step b4, carrying out normalization processing on the first fusion characteristic based on a second normalization layer to obtain a second normalization characteristic;

and b5, inputting the second normalized characteristic into a multi-layer sensor to obtain the second characteristic.

The first normalization layer and the second normalization layer may be LayerNorm layers, and parameter values of the first normalization layer and the second normalization layer may be different.

The fine-tuning structure layer may be, for example, a Low-Rank Adaptation (LoRA) layer, where the inner structure of the LoRA is shown in fig. 3, and the LoRA layer includes two fully-connected layers and a nonlinear active layer (GELU), the first fully-connected layer (FC Down) is a linear layer, its parameter dimension is M1 x M2, the purpose is to perform parameter downsampling, reduce the number of model channels, and the second fully-connected layer (FC UP) is also a linear layer, its parameter dimension is M2 x M1, and the purpose is to perform parameter upsampling and recover the number of model channels.

In the parameter dimension described above, M1 > M2, the purpose is that after downsampling, even though M2 is small, enough information is still captured because M1 > M2, so that more weight matrices are more likely to be accommodated than a single type of weight with a larger rank.

The internal structure of the spatial feature extraction module is shown in fig. 4a, and includes a fine-tuning structure layer lore, which is located beside a multi-head self-attention (MSA) module and is fused with the output of the multi-head self-attention module.

The whole flow is as follows: the third coding feature is input into a first normalization layer (LayerNorm) to be normalized to obtain a first planning feature, the first normalization feature is input into a LoRA layer and an MSA, the outputs of the LoRA and the MSA are fused with the third coding feature, the fused third coding feature is input into a second normalization layer (LayerNorm), the output of the second normalization layer is input into a multi-layer perceptron (MLP, multilayer Perceptron), and the output of the MLP is fused with the feature input into the second normalization layer to obtain the second feature.

In contrast to the SAM model, referring to fig. 4b, which shows the internal structure of the decoder of the SAM model, as compared with fig. 4a, the spatial feature extraction module in the present disclosure adds the lorea layer on the basis of the decoder of the SAM model.

Here, the LoRA layer is placed directly beside the MSA and no jump connection is made, the purpose of which is to make the final video segmentation model closer to the SAM model to preserve model performance.

2. And (5) extracting time characteristics.

In performing temporal feature extraction, it is necessary to additionally consider the association relationship between frames. By way of example, the temporal feature extraction may be performed by:

and step c1, carrying out first channel adjustment on the third coding feature to obtain an adjustment feature.

And c2, inputting the adjustment feature into an attention mechanism model comprising a fine adjustment structure layer to obtain a third feature.

And c3, performing second channel adjustment on the third characteristic to obtain the first characteristic.

The channel adjustment may be implemented by a reshape operation, for example. The third coding feature is z ₀ ∈R ^T*(N+1)*D After adjustment through the first channel, the adjustment feature may be z' ₀ ∈R ^(N+1)*T*D 。

Here, if the third coding feature is directly input to the MSA and the lorea (the same as the spatial feature extraction module described above) without the first channel adjustment, the operations such as the subsequent MSA are performed in the (n+1) dimension, that is, in the image block dimension; after the adjustment of the first channel, the dimension of the adjustment feature changes, and the subsequent operations such as MSA and the like are all in the dimension of T, wherein T represents the number of video frames, so that the relation among the video frames of the T frames can be learned by the method.

After the adjustment feature is input to the attention mechanism model including the fine adjustment structure layer to obtain a third feature, the dimension of the third feature is the same as the dimension of the adjustment feature, and since the second feature and the first feature need to be fused, in order to keep the feature dimension consistent, a second channel adjustment needs to be performed on the third feature, and the feature dimension of the adjusted second feature is the same as the feature dimension of the first feature.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a temporal feature extraction module according to an embodiment of the present disclosure, including a reshape operation for performing a first channel adjustment and a reshape operation for performing a second channel adjustment, and the rest of an attention mechanism model including a fine tuning structure layer is the same as the model structures described in fig. 4a and 4b, which will not be described herein.

It should be noted that, although the model structures of the attention mechanism models including the fine adjustment structure layer of the temporal feature extraction module and the spatial feature extraction module are the same, specific model parameter values may be different.

For step 204,

In a possible implementation manner, when determining the segmentation mask image corresponding to the video frame based on the first feature and the second feature, the first feature and the second feature may be fused to obtain a second fused feature; and then determining a segmentation mask image corresponding to the video frame based on the second fusion characteristic.

For example, when determining the segmentation mask image corresponding to the video frame based on the second fusion feature, the second fusion feature may be input into a multi-layer perceptron to determine the segmentation mask image.

Here, if the video frame is a video frame after the decimated frame sampling, the number of the division mask images is the same as the number after the video frame.

If the video frame is a video frame after frame extraction and sampling, and the video segmentation task is to segment each frame of video frame in the video to be segmented, the segmentation mask image of the video frames of the rest non-extracted frames is also required to be determined through the segmentation mask image.

Taking the video frame of the extracted frame as a first video frame and the video frame of the non-extracted frame as a second video frame, wherein the segmentation mask image corresponding to the first video frame is determined through the steps 201 to 204, the first video frame and the second video frame are combined to form the video to be segmented, and the segmentation mask image of the second video frame can be determined according to the segmentation mask image of the first video frame so as to segment the second video frame based on the segmentation mask image of the second video frame.

The segmentation mask image may be a binary image, and the pixel point corresponding to the pixel position with the value of 1 in the segmentation mask image in the video frame is the pixel point required to be segmented from the video frame, and the pixel point corresponding to the pixel position with the value of 0 in the video frame is the pixel point not required to be segmented.

Since the difference between adjacent N frames of video frames is small, the division mask image of the second video frame can be determined by the difference information between the video frames.

For example, if the a-th frame video frame is the first video frame, when determining the segmentation mask image of the a+k-th frame video frame (which is the second video frame), pixel displacement information of the entity to be segmented between the a-th frame video frame and the a+k-th frame video frame may be determined, and then, based on the pixel displacement information, the pixel position with a value of 1 in the a-th frame video frame is adjusted, where the adjusted segmentation mask image is the segmentation mask image of the a+k-th frame video frame.

For any frame of the second video frame, when determining the segmentation mask image corresponding to the frame of the video frame, a first video frame closest to the frame of the video frame may be determined first, and then the segmentation mask image corresponding to the frame of the video frame may be determined based on the determined segmentation mask image of the first video frame.

If the first video frame closest to the frame has two frames, the similarity between the two frames of the first video frame and the second video frame can be calculated respectively, and the segmentation mask image of the second video frame of the frame is determined based on the segmentation mask image of the first video frame with higher similarity.

Or, for any second video frame, when determining the segmentation mask image corresponding to the video frame of the frame, the first video frame with the highest similarity to the video frame of the frame can be directly determined, and then the segmentation mask image corresponding to the video frame of the frame is determined based on the determined segmentation mask image of the first video frame.

In another possible implementation, in determining the segmentation mask image of the second video frame, optical flow information between the second video frame and the first video frame may be calculated, and then the segmentation mask image of the first video frame is adjusted based on the optical flow information to determine the segmentation mask image of the second video frame.

Here, the first video frame participating in the calculation of the optical flow information may be a first video frame adjacent to the second video frame, and although the optical flow information is calculated, the optical flow calculation herein is low in complexity because the number of video frames participating in the calculation of the optical flow information is small.

The overall architecture of the video segmentation model is described below with reference to the specific drawings. Referring to fig. 6, an overall architecture diagram of a video segmentation model according to an embodiment of the present disclosure mainly includes an image encoder, a hint information encoder, and a decoder, where the image encoder inputs the decoder after encoding an image (i.e., a video frame), and the hint information encoder inputs the decoder after encoding a segmentation hint information, and the decoder mainly includes two parts, one is a spatial feature extraction module, and the other is a temporal feature extraction module, and the spatial feature extraction module and the temporal feature extraction module may respectively perform spatial feature extraction and temporal feature extraction on an input feature, and after the two branches of inputs are fused, the two branches of inputs may be used to determine a MASK, i.e., a segmentation MASK image.

When the video segmentation model is trained, the video segmentation model can be regarded as a fine-tuning training process, the required data volume is small, and the specific training process can comprise the following steps:

and d 1, acquiring a sample video, segmentation prompt information and an annotation mask image corresponding to the sample video.

The labeling mask image may be manually labeled or based on a mask image labeled in other manners and used for representing a segmentation result, where the segmentation result is consistent with the segmentation prompt information.

And d2, inputting the sample video and the segmentation prompt information into a video segmentation model to be trained, and determining a segmentation mask image predicted by the video segmentation model.

And d3, determining a loss value of the training based on the predicted segmentation mask image and the annotation mask image, and adjusting the parameter value of the video segmentation model to be trained based on the loss value.

When the video to be segmented is segmented, after the video frame and the segmentation prompt information are encoded, feature extraction can be performed from the spatial domain and the temporal domain respectively, then a segmentation mask image corresponding to the video frame is determined based on a first feature extracted from the temporal feature and a second feature extracted from the spatial feature, and segmentation processing is performed based on the segmentation mask image. In this way, the association relation of each video frame in the time domain is considered when the feature extraction is performed, so that the segmentation mask image obtained by the method has more accurate segmentation result when the video frames of the video are segmented.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same inventive concept, the embodiments of the present disclosure further provide a video segmentation apparatus corresponding to the video segmentation method, and since the principle of solving the problem by the apparatus in the embodiments of the present disclosure is similar to that of the video segmentation method described in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 7, an architecture diagram of a video segmentation apparatus according to an embodiment of the disclosure is provided, where the apparatus includes: an acquisition module 701, an encoding module 702, a decoding module 703 and a determination module 704; wherein, the liquid crystal display device comprises a liquid crystal display device,

the acquisition module 701 is configured to acquire a video to be segmented and segmentation prompt information corresponding to the video to be segmented; the segmentation prompt information is used for representing the segmentation requirement of the video to be segmented;

the encoding module 702 is configured to encode the video frame of the video to be segmented based on an image encoder, so as to obtain a first encoding feature; and encoding the segmentation prompt information based on a prompt information encoder to obtain a second encoding characteristic;

a decoding module 703, configured to input the first encoded feature and the second encoded feature into a decoder, perform temporal feature extraction to obtain a first feature, and perform spatial feature extraction to obtain a second feature;

A determining module 704, configured to determine a segmentation mask image corresponding to the video frame based on the first feature and the second feature, and perform segmentation processing on the video frame based on the segmentation mask image.

In a possible implementation manner, the video frame of the video to be segmented is a video frame obtained after sampling and frame extracting; the encoding module 702 is configured to, when encoding the video frame of the video to be segmented based on an image encoder, obtain a first encoding feature:

for any video frame, dividing the video frame into a plurality of image blocks;

In a possible implementation, the decoding module 703 is configured to, when inputting the first coding feature and the second coding feature into a decoder:

the third encoding feature is input into the decoder.

In a possible implementation manner, the decoding module 703 is configured to perform the spatial feature extraction by:

In a possible implementation manner, the decoding module 703 is configured to, when inputting the third coding feature into an attention mechanism model including a fine-tuning structural layer, obtain the second feature:

In a possible implementation manner, the decoding module 703 is configured to perform the temporal feature extraction by:

In a possible implementation manner, the determining module 704 is configured to, when determining the segmentation mask image corresponding to the video frame based on the first feature and the second feature:

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 8, a schematic diagram of a computer device 800 according to an embodiment of the disclosure includes a processor 801, a memory 802, and a bus 803. The memory 802 is used for storing execution instructions, including a memory 8021 and an external memory 8022; the memory 8021 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 801 and data exchanged with an external memory 8022 such as a hard disk, and the processor 801 exchanges data with the external memory 8022 through the memory 8021, and when the computer device 800 operates, the processor 801 and the memory 802 communicate with each other through the bus 803, so that the processor 801 executes the following instructions:

In a possible implementation manner, in the instructions executed by the processor 801, the video frame of the video to be segmented is a video frame obtained after sampling and frame extracting; the image-based encoder encodes the video frames of the video to be segmented to obtain first encoding features, including:

for any video frame, dividing the video frame into a plurality of image blocks;

In a possible implementation manner, in the instructions executed by the processor 801, the inputting the first coding feature and the second coding feature into a decoder includes:

the third encoding feature is input into the decoder.

In a possible implementation manner, the decoder is configured to perform the spatial feature extraction by using the following method in an instruction executed by the processor 801:

In a possible implementation manner, in an instruction executed by the processor 801, the inputting the third coding feature into an attention mechanism model including a fine-tuning structural layer, to obtain the second feature includes:

In a possible implementation manner, the decoder is configured to perform the time feature extraction by using the following method in an instruction executed by the processor 801:

In a possible implementation manner, in the instructions executed by the processor 801, the determining, based on the first feature and the second feature, a segmentation mask image corresponding to the video frame includes:

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the video segmentation method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiments of the present disclosure further provide a computer program product, where the computer program product carries program code, where instructions included in the program code may be used to perform the steps of the video segmentation method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein.

Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of video segmentation, comprising:

2. The method according to claim 1, wherein the video frame of the video to be segmented is a video frame obtained after sampling and frame extracting; the image-based encoder encodes the video frames of the video to be segmented to obtain first encoding features, including:

For any video frame, dividing the video frame into a plurality of image blocks;

3. The method of claim 1, wherein the inputting the first encoding feature and the second encoding feature into a decoder comprises:

the third encoding feature is input into the decoder.

4. A method according to claim 3, wherein the decoder is adapted to perform the spatial feature extraction by:

5. The method of claim 4, wherein said inputting the third encoded feature into an attention mechanism model comprising a fine-tuning structural layer results in the second feature, comprising:

6. A method according to claim 3, wherein the decoder is adapted to perform the temporal feature extraction by:

7. The method of claim 1, wherein the determining a segmentation mask image corresponding to the video frame based on the first feature and the second feature comprises:

8. A video segmentation apparatus, comprising:

9. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the video segmentation method according to any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the video segmentation method according to any one of claims 1 to 7.