CN114140488A

CN114140488A - Video target segmentation method and device and training method of video target segmentation model

Info

Publication number: CN114140488A
Application number: CN202111440935.8A
Authority: CN
Inventors: 王伟农; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-04

Abstract

The disclosure relates to a video target segmentation method and device and a training method of a video target segmentation model. The video object segmentation method comprises the following steps: acquiring a video to be processed; for each image frame in a video to be processed, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain a target mask of the image frame; performing target segmentation on the video to be processed based on a target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.

Description

Video target segmentation method and device and training method of video target segmentation model

Technical Field

The present disclosure relates to the field of video processing, and in particular, to a method and an apparatus for segmenting a video target, and a method for training a video target segmentation model.

Background

The video target segmentation technology has wide range, wide application field and huge market potential, and can be applied to the fields of short video intelligent editing, special effect making, short video creation and the like. At present, in order to obtain a relatively stable segmentation result by using time sequence information, a video object segmentation technology based on deep learning adopts means such as optical flow or feature correlation to capture context information on a time sequence, and although a relatively good segmentation result is obtained, a huge amount of calculation is introduced, so that the speed performance of an algorithm used by a model is reduced, a speed bottleneck is caused to actual deployment application, and especially for mobile terminal application.

Disclosure of Invention

The disclosure provides a video target segmentation method and device and a training method of a video target segmentation model, which are used for at least solving the problems of large computation amount and low convergence rate of a video target segmentation technology based on deep learning in the related technology.

According to a first aspect of the embodiments of the present disclosure, there is provided a video object segmentation method, including: acquiring a video to be processed; for each image frame in a video to be processed, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain a target mask of the image frame; performing target segmentation on the video to be processed based on a target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.

Optionally, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model to obtain a target mask of the image frame, including: determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame, wherein each of the guided encoding matrices of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.

Optionally, determining a guided encoding matrix for at least one adjacent image frame based on a target mask for the at least one adjacent image frame adjacent to the image frame comprises: for an image frame in at least one adjacent image frame, acquiring a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.

Optionally, determining a first position corresponding to the target mask of the image frame, where the value is greater than or equal to a first threshold value, includes: and carrying out corrosion operation on the target mask of the image frame, and determining a first position with a value greater than or equal to a first threshold value in the target mask after the corrosion operation.

Optionally, determining a second position corresponding to an object mask of the image frame, where the value is less than or equal to a second threshold value, includes: and performing expansion operation on the target mask of the image frame, and determining a second position with a value less than or equal to a second threshold value in the target mask after the expansion operation.

Optionally, inputting the pixel matrix of the image frame and the guiding coding matrix of at least one adjacent image frame into the video object segmentation model to obtain the object mask of the image frame, including: performing normalization processing on a pixel matrix of the image frame to obtain the pixel matrix after the image frame is processed; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

Optionally, inputting the pixel matrix of the image frame and the guiding coding matrix of at least one adjacent image frame into the video object segmentation model to obtain the object mask of the image frame, including: inputting the guide coding matrix of at least one adjacent image frame into a predetermined number of convolution layers to obtain a processed guide coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

According to a second aspect of the embodiments of the present disclosure, there is provided a training method for a video object segmentation model, including: acquiring a training sample set, wherein the training sample set comprises a plurality of training videos and an actual target mask of each image frame in each training video; for each image frame in the training video, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain an estimated mask of the image frame; and training the video target segmentation model based on the estimated mask of each image frame in the training video and the actual target mask of each image frame in the training video.

Optionally, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model to obtain an estimated mask of the image frame, including: determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame, wherein each of the guided encoding matrices of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.

Optionally, before the training, parameters of the first three channels of the first layer of convolutional layer of the video object segmentation model are set as parameters of the first layer of convolutional layer of the image object segmentation model, wherein the image object segmentation model is trained in advance.

According to an aspect of the embodiments of the present disclosure, there is provided a video object segmentation apparatus, including: a video acquisition unit configured to acquire a video to be processed; the image processing device comprises a mask acquisition unit, a target segmentation model generation unit and a processing unit, wherein the mask acquisition unit is configured to input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model to obtain a target mask of the image frame for each image frame in a video to be processed; the segmentation unit is configured to perform target segmentation on the video to be processed based on the target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.

Optionally, the mask obtaining unit is further configured to determine a guiding encoding matrix of at least one adjacent image frame based on the target mask of at least one adjacent image frame adjacent to the image frame, wherein each guiding encoding matrix of the at least one adjacent image frame includes at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.

Optionally, the mask acquiring unit is further configured to acquire a target mask of an image frame for an image frame of at least one adjacent image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.

Optionally, the mask obtaining unit is further configured to perform an etching operation on the target mask of the image frame, and determine a first position in the target mask after the etching operation, where the value is greater than or equal to a first threshold.

Optionally, the mask obtaining unit is further configured to perform a dilation operation on the target mask of the image frame, and determine a second position in the target mask after the dilation operation, where the value is less than or equal to a second threshold.

Optionally, the mask obtaining unit is further configured to perform normalization processing on the pixel matrix of the image frame to obtain the pixel matrix after the image frame processing; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

Optionally, the mask obtaining unit is further configured to input the pilot coding matrix of the at least one adjacent image frame into a predetermined number of convolutional layers, so as to obtain a processed pilot coding matrix of the at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a training apparatus for a video object segmentation model, including: a sample set acquisition unit configured to acquire a training sample set, wherein the training sample set includes a plurality of training videos and an actual target mask for each image frame in each training video; the mask acquisition unit is configured to input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model for each image frame in the training video to obtain an estimated mask of the image frame; a training unit configured to train the video object segmentation model based on the estimated mask for each image frame in the training video and the actual object mask for each image frame in the training video.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the video object segmentation method and/or the training method of the video object segmentation model according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform a video object segmentation method and/or a training method of a video object segmentation model as described above according to the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a video object segmentation method and/or a training method of a video object segmentation model according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the video target segmentation method and device and the training method of the video target segmentation model, when the target mask of the image frame is obtained, the guide coding matrix of the image frame adjacent to the image frame in the video is introduced, namely, the time sequence information is effectively fused in the target segmentation, so that the stability of the video target segmentation result can be obviously improved and the flicker situation can be reduced under the condition of basically not increasing the operation amount and consuming time, and a more precise and stable segmentation result can be obtained. Therefore, the method and the device solve the problems of large operation amount and low convergence speed of the video target segmentation technology based on deep learning in the related art.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating an implementation scenario of a video object segmentation method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow diagram illustrating a method of video object segmentation in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of training a video object segmentation model in accordance with an exemplary embodiment;

FIG. 4 is a block diagram illustrating a video object segmentation apparatus in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a training apparatus for a video object segmentation model in accordance with an exemplary embodiment;

fig. 6 is a block diagram of an electronic device 600 according to an embodiment of the disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

In order to solve the above problems, the present disclosure provides a training method for a video target segmentation model and a video target segmentation method, which can solve the problems of large computation amount and low convergence rate in the related art, and the following description takes segmenting a face in a video as an example.

Fig. 1 is a schematic diagram illustrating an implementation scenario of a video object segmentation method according to an exemplary embodiment of the present disclosure, as shown in fig. 1, the implementation scenario includes a server 100, a user terminal 110, and a user terminal 120, where the number of the user terminals is not limited to 2, and includes not limited to a mobile phone, a personal computer, and the like, the user terminal may install a camera for obtaining a video, and the server may be one server, or several servers form a server cluster, or may be a cloud computing platform or a virtualization center.

After receiving the request for training the video object segmentation model sent by the

user terminal

110, 120, the server 100 counts videos historically received from the

user terminal

110, 120 and labels the faces in the counted videos to obtain object masks in each video with the faces as objects, and combines the labeled videos together to serve as a training sample set, wherein the training sample set includes a plurality of training videos and actual object masks of each image frame in each training video. After the server 100 obtains the training sample set, for each image frame in each training video, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain an estimated mask of the image frame, determining a target loss function based on the estimated mask of each image frame in the training video and an actual target mask of each image frame in the training video, adjusting parameters of the video target segmentation model through the target loss function, and training the video target segmentation model. After the training of the video target segmentation model is completed, the target mask of any input video to be processed can be obtained through the trained video target segmentation model, so that the video to be processed is accurately segmented.

Hereinafter, a video object segmentation method and apparatus, a training method of a video object segmentation model according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 2 to 5.

Fig. 2 is a flowchart illustrating a video object segmentation method according to an exemplary embodiment, as shown in fig. 2, the video object segmentation method includes the following steps:

in step S201, a video to be processed is acquired. The video to be processed can be any video needing target segmentation.

In step S202, for each image frame in the video to be processed, a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame are input to the video target segmentation model, so as to obtain a target mask of the image frame. For example, taking one image frame (hereinafter, collectively referred to as a current image frame) as an example, assuming that the pixel length of the current image frame is H and the pixel width is W, a target mask of N image frames adjacent to the current image frame is obtained and is denoted as M_k∈R^H ^×WWhen N is 1 or 2, a better segmentation result can be obtained in practical application, and of course, different numbers of frame attempts can be made according to actual needs. The adjacent frame refers to an image frame before or after the current image frame in the training video, and can be selected differently according to the practical application, such as the previous N frames and the current image frameAnd N frames are formed by selecting a plurality of frames before and after the previous image frame or the current image frame. Here note

The value of the ith row and the jth column of the target mask representing the image frame of the kth frame can be in a value range of [0, 1%]. Assuming that the current image frame is in RGB form (BGR storage, etc.), the pixel matrix of the RGB image may be represented as X ∈ R^H×W×3In general terms, each pixel of X has a value in the range of [0,255 []。

According to an exemplary embodiment of the present disclosure, inputting a target mask of at least one adjacent image frame adjacent to an image frame and a pixel matrix of the image frame to a video target segmentation model, obtaining the target mask of the image frame, includes: determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame, wherein each of the guided encoding matrices of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame. By the embodiment, the guide coding matrix can be flexibly set according to different guide objects. For example, the above-described guided encoding matrix may be determined based on a target mask for image frames of N frames adjacent to the current image frame, i.e., according to M_kN, N is used to obtain N groups of guiding coding matrices, and the guiding coding matrices are used to guide generation of a target mask of a current image frame after encoding a target mask of an image frame of N frames adjacent to the current image frame, and then the target mask is used as reference information for calculating the target mask of the current image frame, so that time sequence stability of a segmentation result is improved, and flickering situations and the like are reduced.

Specifically, the guidance coding matrix may include at least one of the following information according to different guidance objects:

1) guiding foreground information: through specific coding, a target mask of a target object to be segmented in time sequence is provided for a current image frame, and a video target segmentation module is guided forward to generate the target mask of the current image frame;

2) guidance background information: through specific coding, the target mask which is used as a background in time sequence is provided for the current image frame, and a video target segmentation model is reversely guided to generate the target mask of the current image frame;

3) invalid boot information: the information does not produce a guiding effect on the video target segmentation model, and the video target segmentation model is expected to play a role of the image target segmentation model, namely the information can be applied in the following scenes: when there is no target mask for adjacent frames, it is still desirable that the video target segmentation model can function properly, or that the video target segmentation model can be provided with both segmented still images and video. It should be noted that, for the invalid guidance information, a probability p (an experimental default value is 0.3, and may be specifically adjusted according to a specific situation) may be further set in the video target segmentation model, where the probability p represents that when an image frame of N frames adjacent to the current image frame is encoded, a 30% probability exists that the encoding matrix will be guided

The full coding is invalid guide coding, namely, the guide coding matrix only contains invalid guide information, so that the robustness of the model is improved.

The three kinds of information are independent, different combination forms can be carried out according to specific requirements, and the common combination forms can be as follows: guide foreground + invalid guide, guide foreground + guide background.

According to an exemplary embodiment of the present disclosure, determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame includes: for an image frame in at least one adjacent image frame, acquiring a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame. According to the embodiment, at least one adjacent image frame can be conveniently and quickly converted into the guide coding matrix so as to guide the target segmentation of the current image frame.

For example, for each image frame of at least one predetermined image frame, a value representing a position of guide foreground information in a guide encoding matrix of the image frame is b + σ, a value representing a position of guide background information is b- σ, and a value representing a position of ineffective guide information is b, wherein the position representing the guide foreground information is a position corresponding to a position having a value of a first threshold or more in a target mask of the image frame, the position representing the guide background information is a position corresponding to a position having a value of a second threshold or less in the target mask of the image frame, the position representing the ineffective guide information is a position corresponding to a position having a value of a second threshold or more and less than the first threshold in the target mask of the image frame, and b and σ are positive integers. The guiding coding matrix determined by the embodiment can conveniently and quickly acquire the required information.

Specifically, it can be said

N is a leading encoding matrix for the k-th adjacent frame for the current image frame X, and in a first encoding mode:

and the value set is { b-sigma, b, b + sigma }, the value b + sigma is used for coding guide foreground information, b-sigma is used for coding guide background information, and b is used for coding invalid guide information. Specifically, M can be extracted_kIs predicted as the position of the foreground (namely, the position with the value greater than or equal to the first threshold value in the target mask of the image frame), and then the position is predicted as the position of the foreground

The same position in (a) is set to b + σ; can extract M_kIs predicted as the position of the background (namely, the position with the value less than or equal to the second threshold value in the target mask of the image frame), and then the image frame is predicted to be the position of the background

The same position in (a) is set to b-sigma; m can be extracted if it is not necessary to guide the background and foreground_kThe positions with the median value larger than the second threshold and smaller than the first threshold will be

The same position in (a) is set as b. B may be 0 and σ may be 1.

For another example, for each image frame of at least one predetermined image frame, a value indicating a position of guide foreground information in a guide encoding matrix of the image frame is [0, 1], a value indicating a position of guide background information is [1, 0], and a value indicating a position of ineffective guide information is [0, 0], wherein the position indicating the guide foreground information is a position corresponding to a position having a value of a first threshold or more in a target mask of the image frame, the position indicating the guide background information is a position corresponding to a position having a value of a second threshold or less in the target mask of the image frame, and the position indicating the ineffective guide information is a position corresponding to a position having a value of a second threshold or more and less than the first threshold in the target mask of the image frame. The guiding coding matrix determined by the embodiment can conveniently and quickly acquire the required information.

Specifically, it is to be noted

N is a leading encoding matrix for the k-th adjacent frame for the current image frame X, and in the second encoding mode:

taking a value set as { b, b + sigma }; for the ith row and the jth column, if the values of the two elements are [0, 0]]Then the code is invalid guiding information; if two elements take on the value of [0,1]Then the representation is coded as guide foreground information; if two elements take on the value of [1, 0]Then the representation is encoded as the guide background information. Specifically, M can be extracted_kIs predicted as the position of the foreground (namely, the position with the value greater than or equal to the first threshold value in the target mask of the image frame), and then the position is predicted as the position of the foreground

Is set to [0, 1]](ii) a Can extract M_kIs predicted as the position of the background (namely, the position with the value less than or equal to the second threshold value in the target mask of the image frame), and then the image frame is predicted to be the position of the background

Is set to [1, 0]](ii) a M can be extracted if it is not necessary to guide the background and foreground_kThe positions with the median value larger than the second threshold and smaller than the first threshold will be

Is set to [0, 0]]. B may be 0 and σ may be 1.

According to an exemplary embodiment of the present disclosure, determining a first position associated with an out-of-target mask of an image frame equal to or greater than a first threshold comprises: and carrying out corrosion operation on the target mask of the image frame, and determining a first position with a value greater than or equal to a first threshold value in the target mask after the corrosion operation. By the embodiment, a better guide coding matrix can be obtained.

According to an exemplary embodiment of the present disclosure, determining a second position corresponding to an out-of-target mask of the image frame being less than or equal to a second threshold comprises: and performing expansion operation on the target mask of the image frame, and determining a second position with a value less than or equal to a second threshold value in the target mask after the expansion operation. By the embodiment, a better guide coding matrix can be obtained.

For example, if the guide encoding matrix includes guide foreground information, guide background information, and invalid guide information, a feasible way may be to use M_kPerforming imaging etching operation, and extracting M after etching operation_kSetting the position of the middle predicted foreground as b + sigma; will M_kPerforming image expansion operation, and extracting M after the expansion operation_kSetting the position of the middle prediction as the background as b-sigma; m_kAnd the other region is set as b.

According to an exemplary embodiment of the present disclosure, inputting a pixel matrix of an image frame and a guide encoding matrix of at least one adjacent image frame into a video object segmentation model, obtaining an object mask of the image frame, includes: performing normalization processing on a pixel matrix of the image frame to obtain the pixel matrix after the image frame is processed; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame. Through this embodiment, can make things convenient for, the quick fuses.

Specifically, the pilot coding matrix of adjacent N frames is obtained

The RGB image pixel matrices X and X of the current image frame may then be fused

k＝1，2，...N, the fusion mode is various and the disclosure does not limit this. For example, a simple and feasible way is to directly and directly perform the X normalization processing and the preprocessing before inputting the video target segmentation model

N is connected (connected) in the channel number dimension (channel), so that the final input channel number of the video object segmentation model is increased from 3 to 3+ N in the first coding mode, and is increased from 3 to 3+ 2N in the second coding mode. The normalization process may be to normalize the pixel values of X to [ b-sigma, b + sigma ]]In between (in general, take b to 0 and σ to 1, i.e., normalize the pixel value of X to [ -1, 1]In between).

According to an exemplary embodiment of the present disclosure, inputting a pixel matrix of an image frame and a guide encoding matrix of at least one adjacent image frame into a video object segmentation model, obtaining an object mask of the image frame, includes: inputting the guide coding matrix of at least one adjacent image frame into a predetermined number of convolution layers to obtain a processed guide coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

Specifically, the method can be carried out first

N is sent to a plurality of convolution layers to obtain an output result, then a pixel matrix X is input to an image target segmentation network (namely, an image with 3 channels as input of the model) to obtain an intermediate output result, then the output result and the intermediate output result channel number dimension (channel) are connected (connected), and then the connected output result and the intermediate output result channel number dimension (channel) are sent to a subsequent video target segmentation model。

In step S203, performing target segmentation on the video to be processed based on the target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.

Fig. 3 is a flowchart illustrating a training method of a video object segmentation model according to an exemplary embodiment, where as shown in fig. 3, the training method of the video object segmentation model includes the following steps:

in step S301, a training sample set is obtained, wherein the training sample set includes a plurality of training videos and an actual target mask for each image frame in each training video. For example, videos historically received from the

user terminals

110 and 120 may be counted as training videos, predetermined objects in the counted training videos are labeled to obtain target masks targeting the predetermined objects in each training video, and the labeled videos are combined together to serve as a training sample set, where the predetermined objects are target objects to be segmented according to actual needs. For another example, the labeled video has less resources and is time-consuming and labor-consuming to label, so that the labeled still picture can be used to simulate two or three frames of video, and in short, the still picture and the labeled target mask thereof can be subjected to the same random enhancement through data enhancement methods such as translation, distortion, rotation, radial transformation, thin-plate spline interpolation, and the like, so as to simulate the scenes such as motion, jitter, blur, and the like of objects in the video, and further obtain the training video required by training and the corresponding labeled target mask.

In step S302, for each image frame in the training video, a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame are input to the video target segmentation model, so as to obtain an estimated mask of the image frame. For example, one frame of image frame (hereinafter, collectively referred to as a current image frame) is described as an example, and it is assumed that the current image frame has a pixel length of H and a pixel width of HFor W, a target mask for an image frame of N frames adjacent to the current image frame has been obtained, denoted M_k∈R^H×WWhen N is 1 or 2, a better segmentation result can be obtained in practical application, and of course, different numbers of frame attempts can be made according to actual needs. The adjacent frame refers to an image frame before or after the current image frame in the training video, and different selections can be made according to actual application conditions, for example, the previous N frames of the current image frame, the next N frames of the current image frame, or several frames before and after the current image frame are selected to form N frames. Here note

The value of the ith row and the jth column of the target mask representing the image frame of the kth frame can be in a value range of [0, 1%]. Assuming that the current image frame is in RGB form (BGR storage, etc.), the pixel matrix of the RGB image may be represented as X ∈ R^H×W×3In general, each pixel of X has a value in the range of [0, 255%]。

According to an exemplary embodiment of the present disclosure, inputting a target mask of at least one adjacent image frame adjacent to an image frame and a pixel matrix of the image frame into a video target segmentation model to obtain an estimated mask of the image frame includes: determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame, wherein each of the guided encoding matrices of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame. By the embodiment, the guide coding matrix can be flexibly set according to different guide objects. For example, the above-mentionedThe derivative coding matrix may be determined based on a target mask for image frames of N frames adjacent to the current image frame, i.e., according to M_kN, N is used to obtain N groups of guiding coding matrices, and the guiding coding matrices are used to guide generation of a target mask of a current image frame after encoding a target mask of an image frame of N frames adjacent to the current image frame, and then the target mask is used as reference information for calculating the target mask of the current image frame, so that time sequence stability of a segmentation result is improved, and flickering situations and the like are reduced.

3) invalid boot information: the information does not produce a guiding effect on the video target segmentation model, and the video target segmentation model is expected to play a role of the image target segmentation model, namely the information can be applied in the following scenes: when there is no target mask for adjacent frames, it is still desirable that the video target segmentation model can function properly, or that the video target segmentation model can be provided with both segmented still images and video. It should be noted that, for the invalid guidance information, a probability p (an experimental default value is 0.3, and may be specifically adjusted according to a specific situation) may be set in the video target segmentation model, where the probability p represents that when the image frames of the adjacent N frames are encoded, the encoding matrix will be guided with a 30% probability

For example, for each image frame of at least one predetermined image frame, a value representing a position of guide foreground information in a guide encoding matrix of the image frame is b + σ, a value representing a position of guide background information is b- σ, and a value representing a position of ineffective guide information is b, where the position representing the guide foreground information is a position corresponding to c, the position representing the guide background information is a position corresponding to a position having a value of a second threshold value or less in a target mask of the image frame, the position representing the ineffective guide information is a position corresponding to a position having a value of more than the second threshold value and less than the first threshold value in the target mask of the image frame, and b and σ are positive integers. The guiding coding matrix determined by the embodiment can conveniently and quickly acquire the required information.

Specifically, it can be said

The same position in (a) is set as b. B may be 0 and σ may be 1.

Specifically, it is to be noted

Is set to [0, 0]]. B may be 0 and σ may be 1.

Specifically, adjacent N is obtainedGuided coding matrix for frames

N, the fusion method is various, and the present disclosure does not limit this. For example, a simple and feasible way is to directly and directly perform the X normalization processing and the preprocessing before inputting the video target segmentation model

According to an exemplary embodiment of the present disclosure, inputting a pixel matrix of an image frame and a guide encoding matrix of at least one adjacent image frame into a video object segmentation model, obtaining an object mask of the image frame, includes: inputting the guide coding matrix of at least one adjacent image frame into a predetermined number of convolution layers to obtain a processed guide coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame. By the embodiment, a better fusion result can be obtained.

Specifically, the method can be carried out first

N, sending the k to a plurality of convolution layers to obtain an output result, inputting the pixel matrix X to an image target segmentation network (i.e. an image with 3 channels as input of the model) to obtain an intermediate output result, connecting (registering) the output result with the intermediate output result channel number dimension (channel), and sending the result to a subsequent video target segmentation model.

Returning to fig. 3, in step S303, the video object segmentation model is trained based on the estimated mask of each image frame in the training video and the actual object mask of each image frame in the training video. In step S303, a target loss function may be determined based on the estimated mask of each image frame in the training video and the labeled target mask corresponding to the training video, and then parameters of the video target segmentation model may be adjusted through target loss minimization to perform training.

According to an exemplary embodiment of the present disclosure, before training, parameters of the first three channels of the first layer convolutional layer of the video object segmentation model are set as parameters of the first layer convolutional layer of the image object segmentation model, wherein the image object segmentation model is trained in advance. Through the embodiment, the parameters of the first three channels of the first layer of the video target segmentation model before training can be directly copied to the image target segmentation model, and the image target segmentation model is trained on the basis of a good parameter, so that a better training result can be obtained.

Specifically, there is already an image object segmentation model (i.e., the above-mentioned guide encoding matrix is not fused)

N, the number of input channels of the model is still 3), in order to make the convergence speed faster when training the video object segmentation model and obtain the excellent performance of the image object segmentation model when adopting the invalid guidance message, the parameters in the image object segmentation model can be initialized into the video object segmentation model, that is, for the first layer convolution layer of the model, only the first layer convolution layer in the image object segmentation model needs to be initializedAnd copying the parameters of one layer of the convolutional layer to the positions of the first three channel number characteristic layers of the first layer of the convolutional layer of the video target segmentation model.

According to an exemplary embodiment of the present disclosure, an image target segmentation model is trained by: acquiring a second training sample set, wherein the second training sample set comprises a plurality of training images and labeled target masks corresponding to the plurality of training images; inputting the training image into an image segmentation model to obtain an estimated mask of the training image; determining a target loss function based on the pre-estimated mask and the labeled target mask corresponding to the training image; and adjusting parameters of the image target segmentation model through the target loss function to finish the training of the image target segmentation model.

In summary, the present disclosure provides a video object segmentation method based on adjacent frames for solving the problems of stability and the like in the video object segmentation technology, that is, after obtaining target masks of adjacent N frames (N is an integer greater than 0) of a current image frame in a video, and performing guide coding on the target masks, fusing the target masks with RGB pictures (which refer to a picture, may be in an RGB storage manner, may also be in a BGR storage manner, and the like) of the current image frame into a video object segmentation model, calculating the target mask of the current image frame, and then performing object segmentation on the video based on the target mask of each image frame, so as to effectively fuse timing information in the video, and under the condition of basically not increasing computation workload and time consumption, the method can significantly improve the stability of video object segmentation results, significantly reduce flicker conditions, obtain finer and stable segmentation results, and has a very wide applicability, can be adapted to each subdivision direction of video processing.

Fig. 4 is a block diagram illustrating a video object segmentation apparatus in accordance with an exemplary embodiment. Referring to fig. 4, the apparatus includes a video acquisition unit 40, a mask acquisition unit 42, and a segmentation unit 44.

A video acquisition unit 40 configured to acquire a video to be processed; the mask acquiring unit 42 is configured to, for each image frame in the video to be processed, input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model, so as to obtain a target mask of the image frame; a segmentation unit 44 configured to perform target segmentation on the video to be processed based on the target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to determine a leading encoding matrix of at least one adjacent image frame based on the target mask of the at least one adjacent image frame adjacent to the image frame, wherein each leading encoding matrix of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.

According to an exemplary embodiment of the present disclosure, the mask acquiring unit 42 is further configured to acquire, for an image frame of the at least one adjacent image frame, a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to perform an etching operation on the target mask of the image frame, and determine a first position in the target mask after the etching operation, where the value is greater than or equal to a first threshold.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to perform a dilation operation on the target mask of the image frame, and determine a second position in the target mask after the dilation operation, where the value is less than or equal to a second threshold.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to perform normalization processing on the pixel matrix of the image frame, so as to obtain a pixel matrix after the image frame is processed; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to input the guiding encoding matrix of at least one adjacent image frame into a predetermined number of convolution layers, resulting in a processed guiding encoding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

FIG. 5 is a block diagram illustrating a training apparatus for a video object segmentation model according to an example embodiment. Referring to fig. 5, the apparatus includes a sample set acquisition unit 50, a mask acquisition unit 52, and a training unit 54.

A sample set acquiring unit 50 configured to acquire a training sample set, wherein the training sample set includes a plurality of training videos and an actual target mask for each image frame in each training video; the mask acquiring unit 52 is configured to, for each image frame in the training video, input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model to obtain an estimated mask of the image frame; a training unit 54 configured to train the video object segmentation model based on the estimated mask for each image frame in the training video and the actual object mask for each image frame in the training video.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to determine a leading encoding matrix of at least one adjacent image frame based on the target mask of the at least one adjacent image frame adjacent to the image frame, wherein each leading encoding matrix of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.

According to an exemplary embodiment of the present disclosure, the mask acquiring unit 52 is further configured to acquire, for an image frame of the at least one adjacent image frame, a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to perform an etching operation on the target mask of the image frame, and determine a first position in the target mask after the etching operation, where the value is greater than or equal to a first threshold.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to perform a dilation operation on the target mask of the image frame, and determine a second position in the target mask after the dilation operation, where the value is less than or equal to a second threshold.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to perform normalization processing on the pixel matrix of the image frame, so as to obtain a pixel matrix after the image frame processing; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to input the pilot coding matrix of at least one adjacent image frame into a predetermined number of convolutional layers, to obtain a processed pilot coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.

According to an exemplary embodiment of the present disclosure, before training, parameters of the first three channels of the first layer convolutional layer of the video object segmentation model are set as parameters of the first layer convolutional layer of the image object segmentation model, wherein the image object segmentation model is trained in advance.

According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 6 is a block diagram of an electronic device 600 including at least one memory 601 and at least one processor 602, the at least one memory having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform a method of training a video object segmentation model and a method of video object segmentation according to embodiments of the present disclosure.

By way of example, the electronic device 600 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 600 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 600, the processor 602 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 602 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.

The processor 602 may execute instructions or code stored in memory, where the memory 601 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 601 may be integrated with the processor 602, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 601 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 601 and the processor 602 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 602 can read files stored in the memory 601.

Further, the electronic device 600 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein when executed by at least one processor, instructions in the computer-readable storage medium cause the at least one processor to perform the training method of the video object segmentation model and the video object segmentation method of the embodiments of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, a computer program product is provided, which includes computer instructions, and the computer instructions, when executed by a processor, implement a training method of a video object segmentation model and a video object segmentation method of the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for segmenting video objects, comprising:

acquiring a video to be processed;

for each image frame in the video to be processed, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain a target mask of the image frame;

performing target segmentation on the video to be processed based on a target mask of each image frame in the video to be processed;

wherein the video target segmentation model is trained by: adjusting parameters of the video target segmentation patch model based on an estimated mask obtained through the video target segmentation model for each image frame in a training video sample and an actual target mask of each image frame in the training video sample.

2. The video object segmentation method of claim 1, wherein the inputting the object mask of at least one adjacent image frame adjacent to the image frame and the pixel matrix of the image frame into a video object segmentation model to obtain the object mask of the image frame comprises:

determining a guided encoding matrix for at least one adjacent image frame adjacent to the image frame based on a target mask for the at least one adjacent image frame, wherein each of the guided encoding matrices for the at least one adjacent image frame comprises at least one of: the guiding foreground information is used for forward guiding the video target segmentation model to output a target mask of the image frame, the guiding background information is used for backward guiding the video target segmentation model to output the target mask of the image frame, and the ineffective guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model;

and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into the video target segmentation model to obtain a target mask of the image frame.

3. The video object segmentation method of claim 2 wherein said determining a guided coding matrix for at least one adjacent image frame based on an object mask for the at least one adjacent image frame adjacent to the image frame comprises:

for an image frame of the at least one adjacent image frame,

acquiring a target mask of the image frame;

determining a first position with a value larger than or equal to a first threshold value in the target mask of the image frame, taking a position corresponding to the first position in an initial matrix as a position of guiding foreground information, and adjusting the value of the position of the guiding foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame;

determining a second position with a value smaller than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of guiding background information, and adjusting the value of the position of the guiding background information to a second preset value;

determining a third position with a value larger than the second threshold value and smaller than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position of guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value;

and taking the adjusted initial matrix as a guide coding matrix of the image frame.

4. The method of claim 3, wherein said determining a first location in the object mask associated with the image frame that is greater than or equal to a first threshold comprises:

performing an etching operation on a target mask of the image frame,

and determining a first position of the target mask after the etching operation, wherein the value is greater than or equal to a first threshold value.

5. A training method of a video object segmentation model is characterized by comprising the following steps:

acquiring a training sample set, wherein the training sample set comprises a plurality of training videos and an actual target mask of each image frame in each training video;

for each image frame in the training video, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain an estimated mask of the image frame;

and training the video target segmentation model based on the estimated mask of each image frame in the training video and the actual target mask of each image frame in the training video.

6. A video object segmentation apparatus, comprising:

a video acquisition unit configured to acquire a video to be processed;

the mask acquisition unit is configured to input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model for each image frame in the video to be processed to obtain a target mask of the image frame;

a segmentation unit configured to perform target segmentation on the video to be processed based on a target mask of each image frame in the video to be processed;

7. An apparatus for training a video object segmentation model, comprising:

a sample set acquisition unit configured to acquire a training sample set, wherein the training sample set includes a plurality of training videos and an actual target mask for each image frame in each training video;

the mask acquisition unit is configured to input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model for each image frame in the training video to obtain an estimated mask of the image frame;

a training unit configured to train the video target segmentation model based on the estimated mask for each image frame in the training video and the actual target mask for each image frame in the training video.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video object segmentation method of any one of claims 1 to 4 and/or the training method of the video object segmentation model of claim 5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the video object segmentation method of any one of claims 1 to 4 and/or the training method of the video object segmentation model of claim 5.

10. A computer program product comprising computer instructions, wherein the computer instructions, when executed by a processor, implement the video object segmentation method of any one of claims 1 to 4 and/or the training method of the video object segmentation model of claim 5.