CN114140488A - Video target segmentation method and device and training method of video target segmentation model - Google Patents

Video target segmentation method and device and training method of video target segmentation model Download PDF

Info

Publication number
CN114140488A
CN114140488A CN202111440935.8A CN202111440935A CN114140488A CN 114140488 A CN114140488 A CN 114140488A CN 202111440935 A CN202111440935 A CN 202111440935A CN 114140488 A CN114140488 A CN 114140488A
Authority
CN
China
Prior art keywords
image frame
video
target
mask
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111440935.8A
Other languages
Chinese (zh)
Inventor
王伟农
戴宇荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202111440935.8A priority Critical patent/CN114140488A/en
Publication of CN114140488A publication Critical patent/CN114140488A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The disclosure relates to a video target segmentation method and device and a training method of a video target segmentation model. The video object segmentation method comprises the following steps: acquiring a video to be processed; for each image frame in a video to be processed, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain a target mask of the image frame; performing target segmentation on the video to be processed based on a target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.

Description

Video target segmentation method and device and training method of video target segmentation model
Technical Field
The present disclosure relates to the field of video processing, and in particular, to a method and an apparatus for segmenting a video target, and a method for training a video target segmentation model.
Background
The video target segmentation technology has wide range, wide application field and huge market potential, and can be applied to the fields of short video intelligent editing, special effect making, short video creation and the like. At present, in order to obtain a relatively stable segmentation result by using time sequence information, a video object segmentation technology based on deep learning adopts means such as optical flow or feature correlation to capture context information on a time sequence, and although a relatively good segmentation result is obtained, a huge amount of calculation is introduced, so that the speed performance of an algorithm used by a model is reduced, a speed bottleneck is caused to actual deployment application, and especially for mobile terminal application.
Disclosure of Invention
The disclosure provides a video target segmentation method and device and a training method of a video target segmentation model, which are used for at least solving the problems of large computation amount and low convergence rate of a video target segmentation technology based on deep learning in the related technology.
According to a first aspect of the embodiments of the present disclosure, there is provided a video object segmentation method, including: acquiring a video to be processed; for each image frame in a video to be processed, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain a target mask of the image frame; performing target segmentation on the video to be processed based on a target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.
Optionally, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model to obtain a target mask of the image frame, including: determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame, wherein each of the guided encoding matrices of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.
Optionally, determining a guided encoding matrix for at least one adjacent image frame based on a target mask for the at least one adjacent image frame adjacent to the image frame comprises: for an image frame in at least one adjacent image frame, acquiring a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.
Optionally, determining a first position corresponding to the target mask of the image frame, where the value is greater than or equal to a first threshold value, includes: and carrying out corrosion operation on the target mask of the image frame, and determining a first position with a value greater than or equal to a first threshold value in the target mask after the corrosion operation.
Optionally, determining a second position corresponding to an object mask of the image frame, where the value is less than or equal to a second threshold value, includes: and performing expansion operation on the target mask of the image frame, and determining a second position with a value less than or equal to a second threshold value in the target mask after the expansion operation.
Optionally, inputting the pixel matrix of the image frame and the guiding coding matrix of at least one adjacent image frame into the video object segmentation model to obtain the object mask of the image frame, including: performing normalization processing on a pixel matrix of the image frame to obtain the pixel matrix after the image frame is processed; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
Optionally, inputting the pixel matrix of the image frame and the guiding coding matrix of at least one adjacent image frame into the video object segmentation model to obtain the object mask of the image frame, including: inputting the guide coding matrix of at least one adjacent image frame into a predetermined number of convolution layers to obtain a processed guide coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
According to a second aspect of the embodiments of the present disclosure, there is provided a training method for a video object segmentation model, including: acquiring a training sample set, wherein the training sample set comprises a plurality of training videos and an actual target mask of each image frame in each training video; for each image frame in the training video, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain an estimated mask of the image frame; and training the video target segmentation model based on the estimated mask of each image frame in the training video and the actual target mask of each image frame in the training video.
Optionally, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model to obtain an estimated mask of the image frame, including: determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame, wherein each of the guided encoding matrices of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.
Optionally, determining a guided encoding matrix for at least one adjacent image frame based on a target mask for the at least one adjacent image frame adjacent to the image frame comprises: for an image frame in at least one adjacent image frame, acquiring a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.
Optionally, determining a first position corresponding to the target mask of the image frame, where the value is greater than or equal to a first threshold value, includes: and carrying out corrosion operation on the target mask of the image frame, and determining a first position with a value greater than or equal to a first threshold value in the target mask after the corrosion operation.
Optionally, determining a second position corresponding to an object mask of the image frame, where the value is less than or equal to a second threshold value, includes: and performing expansion operation on the target mask of the image frame, and determining a second position with a value less than or equal to a second threshold value in the target mask after the expansion operation.
Optionally, inputting the pixel matrix of the image frame and the guiding coding matrix of at least one adjacent image frame into the video object segmentation model to obtain the object mask of the image frame, including: performing normalization processing on a pixel matrix of the image frame to obtain the pixel matrix after the image frame is processed; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
Optionally, inputting the pixel matrix of the image frame and the guiding coding matrix of at least one adjacent image frame into the video object segmentation model to obtain the object mask of the image frame, including: inputting the guide coding matrix of at least one adjacent image frame into a predetermined number of convolution layers to obtain a processed guide coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
Optionally, before the training, parameters of the first three channels of the first layer of convolutional layer of the video object segmentation model are set as parameters of the first layer of convolutional layer of the image object segmentation model, wherein the image object segmentation model is trained in advance.
According to an aspect of the embodiments of the present disclosure, there is provided a video object segmentation apparatus, including: a video acquisition unit configured to acquire a video to be processed; the image processing device comprises a mask acquisition unit, a target segmentation model generation unit and a processing unit, wherein the mask acquisition unit is configured to input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model to obtain a target mask of the image frame for each image frame in a video to be processed; the segmentation unit is configured to perform target segmentation on the video to be processed based on the target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.
Optionally, the mask obtaining unit is further configured to determine a guiding encoding matrix of at least one adjacent image frame based on the target mask of at least one adjacent image frame adjacent to the image frame, wherein each guiding encoding matrix of the at least one adjacent image frame includes at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.
Optionally, the mask acquiring unit is further configured to acquire a target mask of an image frame for an image frame of at least one adjacent image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.
Optionally, the mask obtaining unit is further configured to perform an etching operation on the target mask of the image frame, and determine a first position in the target mask after the etching operation, where the value is greater than or equal to a first threshold.
Optionally, the mask obtaining unit is further configured to perform a dilation operation on the target mask of the image frame, and determine a second position in the target mask after the dilation operation, where the value is less than or equal to a second threshold.
Optionally, the mask obtaining unit is further configured to perform normalization processing on the pixel matrix of the image frame to obtain the pixel matrix after the image frame processing; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
Optionally, the mask obtaining unit is further configured to input the pilot coding matrix of the at least one adjacent image frame into a predetermined number of convolutional layers, so as to obtain a processed pilot coding matrix of the at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a training apparatus for a video object segmentation model, including: a sample set acquisition unit configured to acquire a training sample set, wherein the training sample set includes a plurality of training videos and an actual target mask for each image frame in each training video; the mask acquisition unit is configured to input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model for each image frame in the training video to obtain an estimated mask of the image frame; a training unit configured to train the video object segmentation model based on the estimated mask for each image frame in the training video and the actual object mask for each image frame in the training video.
Optionally, the mask obtaining unit is further configured to determine a guiding encoding matrix of at least one adjacent image frame based on the target mask of at least one adjacent image frame adjacent to the image frame, wherein each guiding encoding matrix of the at least one adjacent image frame includes at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.
Optionally, the mask acquiring unit is further configured to acquire a target mask of an image frame for an image frame of at least one adjacent image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.
Optionally, the mask obtaining unit is further configured to perform an etching operation on the target mask of the image frame, and determine a first position in the target mask after the etching operation, where the value is greater than or equal to a first threshold.
Optionally, the mask obtaining unit is further configured to perform a dilation operation on the target mask of the image frame, and determine a second position in the target mask after the dilation operation, where the value is less than or equal to a second threshold.
Optionally, the mask obtaining unit is further configured to perform normalization processing on the pixel matrix of the image frame to obtain the pixel matrix after the image frame processing; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
Optionally, the mask obtaining unit is further configured to input the pilot coding matrix of the at least one adjacent image frame into a predetermined number of convolutional layers, so as to obtain a processed pilot coding matrix of the at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
Optionally, before the training, parameters of the first three channels of the first layer of convolutional layer of the video object segmentation model are set as parameters of the first layer of convolutional layer of the image object segmentation model, wherein the image object segmentation model is trained in advance.
According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the video object segmentation method and/or the training method of the video object segmentation model according to the present disclosure.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform a video object segmentation method and/or a training method of a video object segmentation model as described above according to the present disclosure.
According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a video object segmentation method and/or a training method of a video object segmentation model according to the present disclosure.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
according to the video target segmentation method and device and the training method of the video target segmentation model, when the target mask of the image frame is obtained, the guide coding matrix of the image frame adjacent to the image frame in the video is introduced, namely, the time sequence information is effectively fused in the target segmentation, so that the stability of the video target segmentation result can be obviously improved and the flicker situation can be reduced under the condition of basically not increasing the operation amount and consuming time, and a more precise and stable segmentation result can be obtained. Therefore, the method and the device solve the problems of large operation amount and low convergence speed of the video target segmentation technology based on deep learning in the related art.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a schematic diagram illustrating an implementation scenario of a video object segmentation method according to an exemplary embodiment of the present disclosure;
FIG. 2 is a flow diagram illustrating a method of video object segmentation in accordance with an exemplary embodiment;
FIG. 3 is a flow diagram illustrating a method of training a video object segmentation model in accordance with an exemplary embodiment;
FIG. 4 is a block diagram illustrating a video object segmentation apparatus in accordance with an exemplary embodiment;
FIG. 5 is a block diagram illustrating a training apparatus for a video object segmentation model in accordance with an exemplary embodiment;
fig. 6 is a block diagram of an electronic device 600 according to an embodiment of the disclosure.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.
In order to solve the above problems, the present disclosure provides a training method for a video target segmentation model and a video target segmentation method, which can solve the problems of large computation amount and low convergence rate in the related art, and the following description takes segmenting a face in a video as an example.
Fig. 1 is a schematic diagram illustrating an implementation scenario of a video object segmentation method according to an exemplary embodiment of the present disclosure, as shown in fig. 1, the implementation scenario includes a server 100, a user terminal 110, and a user terminal 120, where the number of the user terminals is not limited to 2, and includes not limited to a mobile phone, a personal computer, and the like, the user terminal may install a camera for obtaining a video, and the server may be one server, or several servers form a server cluster, or may be a cloud computing platform or a virtualization center.
After receiving the request for training the video object segmentation model sent by the user terminal 110, 120, the server 100 counts videos historically received from the user terminal 110, 120 and labels the faces in the counted videos to obtain object masks in each video with the faces as objects, and combines the labeled videos together to serve as a training sample set, wherein the training sample set includes a plurality of training videos and actual object masks of each image frame in each training video. After the server 100 obtains the training sample set, for each image frame in each training video, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain an estimated mask of the image frame, determining a target loss function based on the estimated mask of each image frame in the training video and an actual target mask of each image frame in the training video, adjusting parameters of the video target segmentation model through the target loss function, and training the video target segmentation model. After the training of the video target segmentation model is completed, the target mask of any input video to be processed can be obtained through the trained video target segmentation model, so that the video to be processed is accurately segmented.
Hereinafter, a video object segmentation method and apparatus, a training method of a video object segmentation model according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 2 to 5.
Fig. 2 is a flowchart illustrating a video object segmentation method according to an exemplary embodiment, as shown in fig. 2, the video object segmentation method includes the following steps:
in step S201, a video to be processed is acquired. The video to be processed can be any video needing target segmentation.
In step S202, for each image frame in the video to be processed, a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame are input to the video target segmentation model, so as to obtain a target mask of the image frame. For example, taking one image frame (hereinafter, collectively referred to as a current image frame) as an example, assuming that the pixel length of the current image frame is H and the pixel width is W, a target mask of N image frames adjacent to the current image frame is obtained and is denoted as Mk∈RH ×WWhen N is 1 or 2, a better segmentation result can be obtained in practical application, and of course, different numbers of frame attempts can be made according to actual needs. The adjacent frame refers to an image frame before or after the current image frame in the training video, and can be selected differently according to the practical application, such as the previous N frames and the current image frameAnd N frames are formed by selecting a plurality of frames before and after the previous image frame or the current image frame. Here note
Figure BDA0003383314790000101
The value of the ith row and the jth column of the target mask representing the image frame of the kth frame can be in a value range of [0, 1%]. Assuming that the current image frame is in RGB form (BGR storage, etc.), the pixel matrix of the RGB image may be represented as X ∈ RH×W×3In general terms, each pixel of X has a value in the range of [0,255 []。
According to an exemplary embodiment of the present disclosure, inputting a target mask of at least one adjacent image frame adjacent to an image frame and a pixel matrix of the image frame to a video target segmentation model, obtaining the target mask of the image frame, includes: determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame, wherein each of the guided encoding matrices of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame. By the embodiment, the guide coding matrix can be flexibly set according to different guide objects. For example, the above-described guided encoding matrix may be determined based on a target mask for image frames of N frames adjacent to the current image frame, i.e., according to MkN, N is used to obtain N groups of guiding coding matrices, and the guiding coding matrices are used to guide generation of a target mask of a current image frame after encoding a target mask of an image frame of N frames adjacent to the current image frame, and then the target mask is used as reference information for calculating the target mask of the current image frame, so that time sequence stability of a segmentation result is improved, and flickering situations and the like are reduced.
Specifically, the guidance coding matrix may include at least one of the following information according to different guidance objects:
1) guiding foreground information: through specific coding, a target mask of a target object to be segmented in time sequence is provided for a current image frame, and a video target segmentation module is guided forward to generate the target mask of the current image frame;
2) guidance background information: through specific coding, the target mask which is used as a background in time sequence is provided for the current image frame, and a video target segmentation model is reversely guided to generate the target mask of the current image frame;
3) invalid boot information: the information does not produce a guiding effect on the video target segmentation model, and the video target segmentation model is expected to play a role of the image target segmentation model, namely the information can be applied in the following scenes: when there is no target mask for adjacent frames, it is still desirable that the video target segmentation model can function properly, or that the video target segmentation model can be provided with both segmented still images and video. It should be noted that, for the invalid guidance information, a probability p (an experimental default value is 0.3, and may be specifically adjusted according to a specific situation) may be further set in the video target segmentation model, where the probability p represents that when an image frame of N frames adjacent to the current image frame is encoded, a 30% probability exists that the encoding matrix will be guided
Figure BDA0003383314790000111
The full coding is invalid guide coding, namely, the guide coding matrix only contains invalid guide information, so that the robustness of the model is improved.
The three kinds of information are independent, different combination forms can be carried out according to specific requirements, and the common combination forms can be as follows: guide foreground + invalid guide, guide foreground + guide background.
According to an exemplary embodiment of the present disclosure, determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame includes: for an image frame in at least one adjacent image frame, acquiring a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame. According to the embodiment, at least one adjacent image frame can be conveniently and quickly converted into the guide coding matrix so as to guide the target segmentation of the current image frame.
For example, for each image frame of at least one predetermined image frame, a value representing a position of guide foreground information in a guide encoding matrix of the image frame is b + σ, a value representing a position of guide background information is b- σ, and a value representing a position of ineffective guide information is b, wherein the position representing the guide foreground information is a position corresponding to a position having a value of a first threshold or more in a target mask of the image frame, the position representing the guide background information is a position corresponding to a position having a value of a second threshold or less in the target mask of the image frame, the position representing the ineffective guide information is a position corresponding to a position having a value of a second threshold or more and less than the first threshold in the target mask of the image frame, and b and σ are positive integers. The guiding coding matrix determined by the embodiment can conveniently and quickly acquire the required information.
Specifically, it can be said
Figure BDA0003383314790000121
N is a leading encoding matrix for the k-th adjacent frame for the current image frame X, and in a first encoding mode:
Figure BDA0003383314790000122
and the value set is { b-sigma, b, b + sigma }, the value b + sigma is used for coding guide foreground information, b-sigma is used for coding guide background information, and b is used for coding invalid guide information. Specifically, M can be extractedkIs predicted as the position of the foreground (namely, the position with the value greater than or equal to the first threshold value in the target mask of the image frame), and then the position is predicted as the position of the foreground
Figure BDA0003383314790000123
The same position in (a) is set to b + σ; can extract MkIs predicted as the position of the background (namely, the position with the value less than or equal to the second threshold value in the target mask of the image frame), and then the image frame is predicted to be the position of the background
Figure BDA0003383314790000124
The same position in (a) is set to b-sigma; m can be extracted if it is not necessary to guide the background and foregroundkThe positions with the median value larger than the second threshold and smaller than the first threshold will be
Figure BDA0003383314790000125
The same position in (a) is set as b. B may be 0 and σ may be 1.
For another example, for each image frame of at least one predetermined image frame, a value indicating a position of guide foreground information in a guide encoding matrix of the image frame is [0, 1], a value indicating a position of guide background information is [1, 0], and a value indicating a position of ineffective guide information is [0, 0], wherein the position indicating the guide foreground information is a position corresponding to a position having a value of a first threshold or more in a target mask of the image frame, the position indicating the guide background information is a position corresponding to a position having a value of a second threshold or less in the target mask of the image frame, and the position indicating the ineffective guide information is a position corresponding to a position having a value of a second threshold or more and less than the first threshold in the target mask of the image frame. The guiding coding matrix determined by the embodiment can conveniently and quickly acquire the required information.
Specifically, it is to be noted
Figure BDA0003383314790000131
N is a leading encoding matrix for the k-th adjacent frame for the current image frame X, and in the second encoding mode:
Figure BDA0003383314790000132
taking a value set as { b, b + sigma }; for the ith row and the jth column, if the values of the two elements are [0, 0]]Then the code is invalid guiding information; if two elements take on the value of [0,1]Then the representation is coded as guide foreground information; if two elements take on the value of [1, 0]Then the representation is encoded as the guide background information. Specifically, M can be extractedkIs predicted as the position of the foreground (namely, the position with the value greater than or equal to the first threshold value in the target mask of the image frame), and then the position is predicted as the position of the foreground
Figure BDA0003383314790000133
Is set to [0, 1]](ii) a Can extract MkIs predicted as the position of the background (namely, the position with the value less than or equal to the second threshold value in the target mask of the image frame), and then the image frame is predicted to be the position of the background
Figure BDA0003383314790000134
Is set to [1, 0]](ii) a M can be extracted if it is not necessary to guide the background and foregroundkThe positions with the median value larger than the second threshold and smaller than the first threshold will be
Figure BDA0003383314790000135
Is set to [0, 0]]. B may be 0 and σ may be 1.
According to an exemplary embodiment of the present disclosure, determining a first position associated with an out-of-target mask of an image frame equal to or greater than a first threshold comprises: and carrying out corrosion operation on the target mask of the image frame, and determining a first position with a value greater than or equal to a first threshold value in the target mask after the corrosion operation. By the embodiment, a better guide coding matrix can be obtained.
According to an exemplary embodiment of the present disclosure, determining a second position corresponding to an out-of-target mask of the image frame being less than or equal to a second threshold comprises: and performing expansion operation on the target mask of the image frame, and determining a second position with a value less than or equal to a second threshold value in the target mask after the expansion operation. By the embodiment, a better guide coding matrix can be obtained.
For example, if the guide encoding matrix includes guide foreground information, guide background information, and invalid guide information, a feasible way may be to use MkPerforming imaging etching operation, and extracting M after etching operationkSetting the position of the middle predicted foreground as b + sigma; will MkPerforming image expansion operation, and extracting M after the expansion operationkSetting the position of the middle prediction as the background as b-sigma; mkAnd the other region is set as b.
According to an exemplary embodiment of the present disclosure, inputting a pixel matrix of an image frame and a guide encoding matrix of at least one adjacent image frame into a video object segmentation model, obtaining an object mask of the image frame, includes: performing normalization processing on a pixel matrix of the image frame to obtain the pixel matrix after the image frame is processed; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame. Through this embodiment, can make things convenient for, the quick fuses.
Specifically, the pilot coding matrix of adjacent N frames is obtained
Figure BDA0003383314790000141
The RGB image pixel matrices X and X of the current image frame may then be fused
Figure BDA0003383314790000142
k=1,2,...N, the fusion mode is various and the disclosure does not limit this. For example, a simple and feasible way is to directly and directly perform the X normalization processing and the preprocessing before inputting the video target segmentation model
Figure BDA0003383314790000143
N is connected (connected) in the channel number dimension (channel), so that the final input channel number of the video object segmentation model is increased from 3 to 3+ N in the first coding mode, and is increased from 3 to 3+ 2N in the second coding mode. The normalization process may be to normalize the pixel values of X to [ b-sigma, b + sigma ]]In between (in general, take b to 0 and σ to 1, i.e., normalize the pixel value of X to [ -1, 1]In between).
According to an exemplary embodiment of the present disclosure, inputting a pixel matrix of an image frame and a guide encoding matrix of at least one adjacent image frame into a video object segmentation model, obtaining an object mask of the image frame, includes: inputting the guide coding matrix of at least one adjacent image frame into a predetermined number of convolution layers to obtain a processed guide coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
Specifically, the method can be carried out first
Figure BDA0003383314790000144
N is sent to a plurality of convolution layers to obtain an output result, then a pixel matrix X is input to an image target segmentation network (namely, an image with 3 channels as input of the model) to obtain an intermediate output result, then the output result and the intermediate output result channel number dimension (channel) are connected (connected), and then the connected output result and the intermediate output result channel number dimension (channel) are sent to a subsequent video target segmentation model。
In step S203, performing target segmentation on the video to be processed based on the target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.
Fig. 3 is a flowchart illustrating a training method of a video object segmentation model according to an exemplary embodiment, where as shown in fig. 3, the training method of the video object segmentation model includes the following steps:
in step S301, a training sample set is obtained, wherein the training sample set includes a plurality of training videos and an actual target mask for each image frame in each training video. For example, videos historically received from the user terminals 110 and 120 may be counted as training videos, predetermined objects in the counted training videos are labeled to obtain target masks targeting the predetermined objects in each training video, and the labeled videos are combined together to serve as a training sample set, where the predetermined objects are target objects to be segmented according to actual needs. For another example, the labeled video has less resources and is time-consuming and labor-consuming to label, so that the labeled still picture can be used to simulate two or three frames of video, and in short, the still picture and the labeled target mask thereof can be subjected to the same random enhancement through data enhancement methods such as translation, distortion, rotation, radial transformation, thin-plate spline interpolation, and the like, so as to simulate the scenes such as motion, jitter, blur, and the like of objects in the video, and further obtain the training video required by training and the corresponding labeled target mask.
In step S302, for each image frame in the training video, a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame are input to the video target segmentation model, so as to obtain an estimated mask of the image frame. For example, one frame of image frame (hereinafter, collectively referred to as a current image frame) is described as an example, and it is assumed that the current image frame has a pixel length of H and a pixel width of HFor W, a target mask for an image frame of N frames adjacent to the current image frame has been obtained, denoted Mk∈RH×WWhen N is 1 or 2, a better segmentation result can be obtained in practical application, and of course, different numbers of frame attempts can be made according to actual needs. The adjacent frame refers to an image frame before or after the current image frame in the training video, and different selections can be made according to actual application conditions, for example, the previous N frames of the current image frame, the next N frames of the current image frame, or several frames before and after the current image frame are selected to form N frames. Here note
Figure BDA0003383314790000151
The value of the ith row and the jth column of the target mask representing the image frame of the kth frame can be in a value range of [0, 1%]. Assuming that the current image frame is in RGB form (BGR storage, etc.), the pixel matrix of the RGB image may be represented as X ∈ RH×W×3In general, each pixel of X has a value in the range of [0, 255%]。
According to an exemplary embodiment of the present disclosure, inputting a target mask of at least one adjacent image frame adjacent to an image frame and a pixel matrix of the image frame into a video target segmentation model to obtain an estimated mask of the image frame includes: determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame, wherein each of the guided encoding matrices of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame. By the embodiment, the guide coding matrix can be flexibly set according to different guide objects. For example, the above-mentionedThe derivative coding matrix may be determined based on a target mask for image frames of N frames adjacent to the current image frame, i.e., according to MkN, N is used to obtain N groups of guiding coding matrices, and the guiding coding matrices are used to guide generation of a target mask of a current image frame after encoding a target mask of an image frame of N frames adjacent to the current image frame, and then the target mask is used as reference information for calculating the target mask of the current image frame, so that time sequence stability of a segmentation result is improved, and flickering situations and the like are reduced.
Specifically, the guidance coding matrix may include at least one of the following information according to different guidance objects:
1) guiding foreground information: through specific coding, a target mask of a target object to be segmented in time sequence is provided for a current image frame, and a video target segmentation module is guided forward to generate the target mask of the current image frame;
2) guidance background information: through specific coding, the target mask which is used as a background in time sequence is provided for the current image frame, and a video target segmentation model is reversely guided to generate the target mask of the current image frame;
3) invalid boot information: the information does not produce a guiding effect on the video target segmentation model, and the video target segmentation model is expected to play a role of the image target segmentation model, namely the information can be applied in the following scenes: when there is no target mask for adjacent frames, it is still desirable that the video target segmentation model can function properly, or that the video target segmentation model can be provided with both segmented still images and video. It should be noted that, for the invalid guidance information, a probability p (an experimental default value is 0.3, and may be specifically adjusted according to a specific situation) may be set in the video target segmentation model, where the probability p represents that when the image frames of the adjacent N frames are encoded, the encoding matrix will be guided with a 30% probability
Figure BDA0003383314790000161
The full coding is invalid guide coding, namely, the guide coding matrix only contains invalid guide information, so that the robustness of the model is improved.
The three kinds of information are independent, different combination forms can be carried out according to specific requirements, and the common combination forms can be as follows: guide foreground + invalid guide, guide foreground + guide background.
According to an exemplary embodiment of the present disclosure, determining a guided encoding matrix of at least one adjacent image frame based on a target mask of the at least one adjacent image frame adjacent to the image frame includes: for an image frame in at least one adjacent image frame, acquiring a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame. According to the embodiment, at least one adjacent image frame can be conveniently and quickly converted into the guide coding matrix so as to guide the target segmentation of the current image frame.
For example, for each image frame of at least one predetermined image frame, a value representing a position of guide foreground information in a guide encoding matrix of the image frame is b + σ, a value representing a position of guide background information is b- σ, and a value representing a position of ineffective guide information is b, where the position representing the guide foreground information is a position corresponding to c, the position representing the guide background information is a position corresponding to a position having a value of a second threshold value or less in a target mask of the image frame, the position representing the ineffective guide information is a position corresponding to a position having a value of more than the second threshold value and less than the first threshold value in the target mask of the image frame, and b and σ are positive integers. The guiding coding matrix determined by the embodiment can conveniently and quickly acquire the required information.
Specifically, it can be said
Figure BDA0003383314790000171
N is a leading encoding matrix for the k-th adjacent frame for the current image frame X, and in a first encoding mode:
Figure BDA0003383314790000172
and the value set is { b-sigma, b, b + sigma }, the value b + sigma is used for coding guide foreground information, b-sigma is used for coding guide background information, and b is used for coding invalid guide information. Specifically, M can be extractedkIs predicted as the position of the foreground (namely, the position with the value greater than or equal to the first threshold value in the target mask of the image frame), and then the position is predicted as the position of the foreground
Figure BDA0003383314790000173
The same position in (a) is set to b + σ; can extract MkIs predicted as the position of the background (namely, the position with the value less than or equal to the second threshold value in the target mask of the image frame), and then the image frame is predicted to be the position of the background
Figure BDA0003383314790000174
The same position in (a) is set to b-sigma; m can be extracted if it is not necessary to guide the background and foregroundkThe positions with the median value larger than the second threshold and smaller than the first threshold will be
Figure BDA0003383314790000175
The same position in (a) is set as b. B may be 0 and σ may be 1.
For another example, for each image frame of at least one predetermined image frame, a value indicating a position of guide foreground information in a guide encoding matrix of the image frame is [0, 1], a value indicating a position of guide background information is [1, 0], and a value indicating a position of ineffective guide information is [0, 0], wherein the position indicating the guide foreground information is a position corresponding to a position having a value of a first threshold or more in a target mask of the image frame, the position indicating the guide background information is a position corresponding to a position having a value of a second threshold or less in the target mask of the image frame, and the position indicating the ineffective guide information is a position corresponding to a position having a value of a second threshold or more and less than the first threshold in the target mask of the image frame. The guiding coding matrix determined by the embodiment can conveniently and quickly acquire the required information.
Specifically, it is to be noted
Figure BDA0003383314790000181
N is a leading encoding matrix for the k-th adjacent frame for the current image frame X, and in the second encoding mode:
Figure BDA0003383314790000182
taking a value set as { b, b + sigma }; for the ith row and the jth column, if the values of the two elements are [0, 0]]Then the code is invalid guiding information; if two elements take on the value of [0,1]Then the representation is coded as guide foreground information; if two elements take on the value of [1, 0]Then the representation is encoded as the guide background information. Specifically, M can be extractedkIs predicted as the position of the foreground (namely, the position with the value greater than or equal to the first threshold value in the target mask of the image frame), and then the position is predicted as the position of the foreground
Figure BDA0003383314790000183
Is set to [0, 1]](ii) a Can extract MkIs predicted as the position of the background (namely, the position with the value less than or equal to the second threshold value in the target mask of the image frame), and then the image frame is predicted to be the position of the background
Figure BDA0003383314790000184
Is set to [1, 0]](ii) a M can be extracted if it is not necessary to guide the background and foregroundkThe positions with the median value larger than the second threshold and smaller than the first threshold will be
Figure BDA0003383314790000185
Is set to [0, 0]]. B may be 0 and σ may be 1.
According to an exemplary embodiment of the present disclosure, determining a first position associated with an out-of-target mask of an image frame equal to or greater than a first threshold comprises: and carrying out corrosion operation on the target mask of the image frame, and determining a first position with a value greater than or equal to a first threshold value in the target mask after the corrosion operation. By the embodiment, a better guide coding matrix can be obtained.
According to an exemplary embodiment of the present disclosure, determining a second position corresponding to an out-of-target mask of the image frame being less than or equal to a second threshold comprises: and performing expansion operation on the target mask of the image frame, and determining a second position with a value less than or equal to a second threshold value in the target mask after the expansion operation. By the embodiment, a better guide coding matrix can be obtained.
For example, if the guide encoding matrix includes guide foreground information, guide background information, and invalid guide information, a feasible way may be to use MkPerforming imaging etching operation, and extracting M after etching operationkSetting the position of the middle predicted foreground as b + sigma; will MkPerforming image expansion operation, and extracting M after the expansion operationkSetting the position of the middle prediction as the background as b-sigma; mkAnd the other region is set as b.
According to an exemplary embodiment of the present disclosure, inputting a pixel matrix of an image frame and a guide encoding matrix of at least one adjacent image frame into a video object segmentation model, obtaining an object mask of the image frame, includes: performing normalization processing on a pixel matrix of the image frame to obtain the pixel matrix after the image frame is processed; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame. Through this embodiment, can make things convenient for, the quick fuses.
Specifically, adjacent N is obtainedGuided coding matrix for frames
Figure BDA0003383314790000191
The RGB image pixel matrices X and X of the current image frame may then be fused
Figure BDA0003383314790000192
N, the fusion method is various, and the present disclosure does not limit this. For example, a simple and feasible way is to directly and directly perform the X normalization processing and the preprocessing before inputting the video target segmentation model
Figure BDA0003383314790000193
N is connected (connected) in the channel number dimension (channel), so that the final input channel number of the video object segmentation model is increased from 3 to 3+ N in the first coding mode, and is increased from 3 to 3+ 2N in the second coding mode. The normalization process may be to normalize the pixel values of X to [ b-sigma, b + sigma ]]In between (in general, take b to 0 and σ to 1, i.e., normalize the pixel value of X to [ -1, 1]In between).
According to an exemplary embodiment of the present disclosure, inputting a pixel matrix of an image frame and a guide encoding matrix of at least one adjacent image frame into a video object segmentation model, obtaining an object mask of the image frame, includes: inputting the guide coding matrix of at least one adjacent image frame into a predetermined number of convolution layers to obtain a processed guide coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame. By the embodiment, a better fusion result can be obtained.
Specifically, the method can be carried out first
Figure BDA0003383314790000194
N, sending the k to a plurality of convolution layers to obtain an output result, inputting the pixel matrix X to an image target segmentation network (i.e. an image with 3 channels as input of the model) to obtain an intermediate output result, connecting (registering) the output result with the intermediate output result channel number dimension (channel), and sending the result to a subsequent video target segmentation model.
Returning to fig. 3, in step S303, the video object segmentation model is trained based on the estimated mask of each image frame in the training video and the actual object mask of each image frame in the training video. In step S303, a target loss function may be determined based on the estimated mask of each image frame in the training video and the labeled target mask corresponding to the training video, and then parameters of the video target segmentation model may be adjusted through target loss minimization to perform training.
According to an exemplary embodiment of the present disclosure, before training, parameters of the first three channels of the first layer convolutional layer of the video object segmentation model are set as parameters of the first layer convolutional layer of the image object segmentation model, wherein the image object segmentation model is trained in advance. Through the embodiment, the parameters of the first three channels of the first layer of the video target segmentation model before training can be directly copied to the image target segmentation model, and the image target segmentation model is trained on the basis of a good parameter, so that a better training result can be obtained.
Specifically, there is already an image object segmentation model (i.e., the above-mentioned guide encoding matrix is not fused)
Figure BDA0003383314790000201
N, the number of input channels of the model is still 3), in order to make the convergence speed faster when training the video object segmentation model and obtain the excellent performance of the image object segmentation model when adopting the invalid guidance message, the parameters in the image object segmentation model can be initialized into the video object segmentation model, that is, for the first layer convolution layer of the model, only the first layer convolution layer in the image object segmentation model needs to be initializedAnd copying the parameters of one layer of the convolutional layer to the positions of the first three channel number characteristic layers of the first layer of the convolutional layer of the video target segmentation model.
According to an exemplary embodiment of the present disclosure, an image target segmentation model is trained by: acquiring a second training sample set, wherein the second training sample set comprises a plurality of training images and labeled target masks corresponding to the plurality of training images; inputting the training image into an image segmentation model to obtain an estimated mask of the training image; determining a target loss function based on the pre-estimated mask and the labeled target mask corresponding to the training image; and adjusting parameters of the image target segmentation model through the target loss function to finish the training of the image target segmentation model.
In summary, the present disclosure provides a video object segmentation method based on adjacent frames for solving the problems of stability and the like in the video object segmentation technology, that is, after obtaining target masks of adjacent N frames (N is an integer greater than 0) of a current image frame in a video, and performing guide coding on the target masks, fusing the target masks with RGB pictures (which refer to a picture, may be in an RGB storage manner, may also be in a BGR storage manner, and the like) of the current image frame into a video object segmentation model, calculating the target mask of the current image frame, and then performing object segmentation on the video based on the target mask of each image frame, so as to effectively fuse timing information in the video, and under the condition of basically not increasing computation workload and time consumption, the method can significantly improve the stability of video object segmentation results, significantly reduce flicker conditions, obtain finer and stable segmentation results, and has a very wide applicability, can be adapted to each subdivision direction of video processing.
Fig. 4 is a block diagram illustrating a video object segmentation apparatus in accordance with an exemplary embodiment. Referring to fig. 4, the apparatus includes a video acquisition unit 40, a mask acquisition unit 42, and a segmentation unit 44.
A video acquisition unit 40 configured to acquire a video to be processed; the mask acquiring unit 42 is configured to, for each image frame in the video to be processed, input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model, so as to obtain a target mask of the image frame; a segmentation unit 44 configured to perform target segmentation on the video to be processed based on the target mask of each image frame in the video to be processed; the video target segmentation model is obtained by training in the following way: and adjusting parameters of the video target segmentation patch model based on the estimated mask obtained by the video target segmentation model aiming at each image frame in the training video sample and the actual target mask of each image frame in the training video sample.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to determine a leading encoding matrix of at least one adjacent image frame based on the target mask of the at least one adjacent image frame adjacent to the image frame, wherein each leading encoding matrix of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.
According to an exemplary embodiment of the present disclosure, the mask acquiring unit 42 is further configured to acquire, for an image frame of the at least one adjacent image frame, a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to perform an etching operation on the target mask of the image frame, and determine a first position in the target mask after the etching operation, where the value is greater than or equal to a first threshold.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to perform a dilation operation on the target mask of the image frame, and determine a second position in the target mask after the dilation operation, where the value is less than or equal to a second threshold.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to perform normalization processing on the pixel matrix of the image frame, so as to obtain a pixel matrix after the image frame is processed; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 42 is further configured to input the guiding encoding matrix of at least one adjacent image frame into a predetermined number of convolution layers, resulting in a processed guiding encoding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
FIG. 5 is a block diagram illustrating a training apparatus for a video object segmentation model according to an example embodiment. Referring to fig. 5, the apparatus includes a sample set acquisition unit 50, a mask acquisition unit 52, and a training unit 54.
A sample set acquiring unit 50 configured to acquire a training sample set, wherein the training sample set includes a plurality of training videos and an actual target mask for each image frame in each training video; the mask acquiring unit 52 is configured to, for each image frame in the training video, input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into the video target segmentation model to obtain an estimated mask of the image frame; a training unit 54 configured to train the video object segmentation model based on the estimated mask for each image frame in the training video and the actual object mask for each image frame in the training video.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to determine a leading encoding matrix of at least one adjacent image frame based on the target mask of the at least one adjacent image frame adjacent to the image frame, wherein each leading encoding matrix of the at least one adjacent image frame comprises at least one of: the method comprises the following steps of guiding foreground information, guiding background information and invalid guiding information, wherein the guiding foreground information is used for forward guiding a target mask of an image frame output by a video target segmentation model, the guiding background information is used for backward guiding the target mask of the image frame output by the video target segmentation model, and the invalid guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model; and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into a video target segmentation model to obtain a target mask of the image frame.
According to an exemplary embodiment of the present disclosure, the mask acquiring unit 52 is further configured to acquire, for an image frame of the at least one adjacent image frame, a target mask of the image frame; determining a first position with a value larger than or equal to a first threshold value in a target mask of an image frame, taking a position corresponding to the first position in an initial matrix as a position for guiding foreground information, and adjusting the value of the position for guiding the foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame; determining a second position with a value less than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of the guiding background information, and adjusting the value of the position of the guiding background information to a second preset value; determining a third position of which the value is greater than the second threshold value and less than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position for guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value; and taking the adjusted initial matrix as a guide coding matrix of the image frame.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to perform an etching operation on the target mask of the image frame, and determine a first position in the target mask after the etching operation, where the value is greater than or equal to a first threshold.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to perform a dilation operation on the target mask of the image frame, and determine a second position in the target mask after the dilation operation, where the value is less than or equal to a second threshold.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to perform normalization processing on the pixel matrix of the image frame, so as to obtain a pixel matrix after the image frame processing; connecting the pixel matrix after image frame processing and a guide coding matrix of at least one adjacent image frame in a channel number dimension to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
According to an exemplary embodiment of the present disclosure, the mask obtaining unit 52 is further configured to input the pilot coding matrix of at least one adjacent image frame into a predetermined number of convolutional layers, to obtain a processed pilot coding matrix of at least one adjacent image frame; inputting a pixel matrix of an image frame into an image target segmentation model to obtain a preset intermediate output result, wherein the preset intermediate output result is the same as the resolution of the processed guide coding matrix; connecting a preset intermediate output result with the processed guide coding matrix in the dimension of the number of channels to obtain a fused matrix; and inputting the fused matrix into a video target segmentation model to obtain a target mask of the image frame.
According to an exemplary embodiment of the present disclosure, before training, parameters of the first three channels of the first layer convolutional layer of the video object segmentation model are set as parameters of the first layer convolutional layer of the image object segmentation model, wherein the image object segmentation model is trained in advance.
According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 6 is a block diagram of an electronic device 600 including at least one memory 601 and at least one processor 602, the at least one memory having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform a method of training a video object segmentation model and a method of video object segmentation according to embodiments of the present disclosure.
By way of example, the electronic device 600 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 600 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the electronic device 600, the processor 602 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 602 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.
The processor 602 may execute instructions or code stored in memory, where the memory 601 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.
The memory 601 may be integrated with the processor 602, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 601 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 601 and the processor 602 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 602 can read files stored in the memory 601.
Further, the electronic device 600 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein when executed by at least one processor, instructions in the computer-readable storage medium cause the at least one processor to perform the training method of the video object segmentation model and the video object segmentation method of the embodiments of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
According to an embodiment of the present disclosure, a computer program product is provided, which includes computer instructions, and the computer instructions, when executed by a processor, implement a training method of a video object segmentation model and a video object segmentation method of the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method for segmenting video objects, comprising:
acquiring a video to be processed;
for each image frame in the video to be processed, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain a target mask of the image frame;
performing target segmentation on the video to be processed based on a target mask of each image frame in the video to be processed;
wherein the video target segmentation model is trained by: adjusting parameters of the video target segmentation patch model based on an estimated mask obtained through the video target segmentation model for each image frame in a training video sample and an actual target mask of each image frame in the training video sample.
2. The video object segmentation method of claim 1, wherein the inputting the object mask of at least one adjacent image frame adjacent to the image frame and the pixel matrix of the image frame into a video object segmentation model to obtain the object mask of the image frame comprises:
determining a guided encoding matrix for at least one adjacent image frame adjacent to the image frame based on a target mask for the at least one adjacent image frame, wherein each of the guided encoding matrices for the at least one adjacent image frame comprises at least one of: the guiding foreground information is used for forward guiding the video target segmentation model to output a target mask of the image frame, the guiding background information is used for backward guiding the video target segmentation model to output the target mask of the image frame, and the ineffective guiding information has no guiding effect on the target mask of the image frame output by the video target segmentation model;
and inputting the pixel matrix of the image frame and the guide coding matrix of at least one adjacent image frame into the video target segmentation model to obtain a target mask of the image frame.
3. The video object segmentation method of claim 2 wherein said determining a guided coding matrix for at least one adjacent image frame based on an object mask for the at least one adjacent image frame adjacent to the image frame comprises:
for an image frame of the at least one adjacent image frame,
acquiring a target mask of the image frame;
determining a first position with a value larger than or equal to a first threshold value in the target mask of the image frame, taking a position corresponding to the first position in an initial matrix as a position of guiding foreground information, and adjusting the value of the position of the guiding foreground information to be a first preset value, wherein the initial matrix is a unit matrix with the same number of rows and columns as the target mask of the image frame;
determining a second position with a value smaller than or equal to a second threshold value in the target mask of the image frame, taking a position corresponding to the second position in the initial matrix as a position of guiding background information, and adjusting the value of the position of the guiding background information to a second preset value;
determining a third position with a value larger than the second threshold value and smaller than the first threshold value in the target mask of the image frame, taking a position corresponding to the third position in the initial matrix as a position of guiding invalid information, and adjusting the value of the position of the invalid guiding information to a third preset value;
and taking the adjusted initial matrix as a guide coding matrix of the image frame.
4. The method of claim 3, wherein said determining a first location in the object mask associated with the image frame that is greater than or equal to a first threshold comprises:
performing an etching operation on a target mask of the image frame,
and determining a first position of the target mask after the etching operation, wherein the value is greater than or equal to a first threshold value.
5. A training method of a video object segmentation model is characterized by comprising the following steps:
acquiring a training sample set, wherein the training sample set comprises a plurality of training videos and an actual target mask of each image frame in each training video;
for each image frame in the training video, inputting a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model to obtain an estimated mask of the image frame;
and training the video target segmentation model based on the estimated mask of each image frame in the training video and the actual target mask of each image frame in the training video.
6. A video object segmentation apparatus, comprising:
a video acquisition unit configured to acquire a video to be processed;
the mask acquisition unit is configured to input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model for each image frame in the video to be processed to obtain a target mask of the image frame;
a segmentation unit configured to perform target segmentation on the video to be processed based on a target mask of each image frame in the video to be processed;
wherein the video target segmentation model is trained by: adjusting parameters of the video target segmentation patch model based on an estimated mask obtained through the video target segmentation model for each image frame in a training video sample and an actual target mask of each image frame in the training video sample.
7. An apparatus for training a video object segmentation model, comprising:
a sample set acquisition unit configured to acquire a training sample set, wherein the training sample set includes a plurality of training videos and an actual target mask for each image frame in each training video;
the mask acquisition unit is configured to input a target mask of at least one adjacent image frame adjacent to the image frame and a pixel matrix of the image frame into a video target segmentation model for each image frame in the training video to obtain an estimated mask of the image frame;
a training unit configured to train the video target segmentation model based on the estimated mask for each image frame in the training video and the actual target mask for each image frame in the training video.
8. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the video object segmentation method of any one of claims 1 to 4 and/or the training method of the video object segmentation model of claim 5.
9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the video object segmentation method of any one of claims 1 to 4 and/or the training method of the video object segmentation model of claim 5.
10. A computer program product comprising computer instructions, wherein the computer instructions, when executed by a processor, implement the video object segmentation method of any one of claims 1 to 4 and/or the training method of the video object segmentation model of claim 5.
CN202111440935.8A 2021-11-30 2021-11-30 Video target segmentation method and device and training method of video target segmentation model Pending CN114140488A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111440935.8A CN114140488A (en) 2021-11-30 2021-11-30 Video target segmentation method and device and training method of video target segmentation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111440935.8A CN114140488A (en) 2021-11-30 2021-11-30 Video target segmentation method and device and training method of video target segmentation model

Publications (1)

Publication Number Publication Date
CN114140488A true CN114140488A (en) 2022-03-04

Family

ID=80389712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111440935.8A Pending CN114140488A (en) 2021-11-30 2021-11-30 Video target segmentation method and device and training method of video target segmentation model

Country Status (1)

Country Link
CN (1) CN114140488A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601375A (en) * 2022-12-15 2023-01-13 深圳思谋信息科技有限公司(Cn) Video frame processing method, device, equipment and computer readable medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601375A (en) * 2022-12-15 2023-01-13 深圳思谋信息科技有限公司(Cn) Video frame processing method, device, equipment and computer readable medium

Similar Documents

Publication Publication Date Title
US11200424B2 (en) Space-time memory network for locating target object in video content
US10755173B2 (en) Video deblurring using neural networks
US11017586B2 (en) 3D motion effect from a 2D image
WO2022077978A1 (en) Video processing method and video processing apparatus
CN113221983B (en) Training method and device for transfer learning model, image processing method and device
CN114565768A (en) Image segmentation method and device
CN114140488A (en) Video target segmentation method and device and training method of video target segmentation model
CN108734718B (en) Processing method, device, storage medium and equipment for image segmentation
CN111914850B (en) Picture feature extraction method, device, server and medium
CN115018734B (en) Video restoration method and training method and device of video restoration model
CN108520259B (en) Foreground target extraction method, device, equipment and storage medium
CN113177483B (en) Video object segmentation method, device, equipment and storage medium
CN113674230B (en) Method and device for detecting key points of indoor backlight face
US20230186608A1 (en) Method, device, and computer program product for video processing
CN112907645A (en) Disparity map acquisition method, disparity map acquisition device, disparity map training method, electronic device, and medium
CN111815638A (en) Training method of video segmentation network model, video segmentation method and related equipment
KR102632640B1 (en) Method and apparatus for pixel-wise matching original contents with target contents
CN112329925B (en) Model generation method, feature extraction method, device and electronic equipment
US11935214B2 (en) Video content removal using flow-guided adaptive learning
CN114125462B (en) Video processing method and device
CN113762393B (en) Model training method, gaze point detection method, medium, device and computing equipment
CN115424184A (en) Video object segmentation method and device and electronic equipment
US20240177466A1 (en) Method performed by electronic apparatus, electronic apparatus and storage medium
Guo et al. A study on the optimization simulation of big data video image keyframes in motion models
CN113076828A (en) Video editing method and device and model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination