WO2022142702A1 - Video image processing method and apparatus - Google Patents

Video image processing method and apparatus Download PDF

Info

Publication number
WO2022142702A1
WO2022142702A1 PCT/CN2021/127942 CN2021127942W WO2022142702A1 WO 2022142702 A1 WO2022142702 A1 WO 2022142702A1 CN 2021127942 W CN2021127942 W CN 2021127942W WO 2022142702 A1 WO2022142702 A1 WO 2022142702A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
frame
feature
video
training
Prior art date
Application number
PCT/CN2021/127942
Other languages
French (fr)
Chinese (zh)
Inventor
曹炎培
赵培尧
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2022142702A1 publication Critical patent/WO2022142702A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Definitions

  • the present disclosure relates to the technical field of computer processing, and in particular, to a video image processing method, apparatus, electronic device, and storage medium.
  • Human pose estimation and human body 3D model reconstruction in video images aims to restore the human body joint position and human body surface 3D model in each video frame.
  • This technology is widely used in security, health monitoring, computer animation, virtual reality, augmentation realistic scenarios.
  • optical flow or Recurrent Neural Network is usually used to extract timing information to reconstruct the dynamic 3D model of the human body.
  • the time series convolutional network extracts the human body features in the input video image, and then uses the extracted human body features to regress the human body pose or 3D model.
  • the present disclosure provides a video image processing method, apparatus, electronic device, computer-readable storage medium, and computer program product.
  • the technical solutions of the present disclosure are as follows:
  • a video image processing method including:
  • the value of i is updated to i+1, and the above-mentioned first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image are input into the time series feature extraction network to be based on the i-th frame image.
  • the first image feature of the +1 frame image and the time series feature of the i-th frame image, the step of generating the 3D reconstruction result of the target object in the i+1-th frame image, until i N, where N is the The total number of frames in the target video.
  • a video image processing apparatus including:
  • the first processing module is configured to input the first frame image in the target video to the 3D reconstruction network and the video frame coding network respectively, and obtain the 3D image of the target object in the first frame image output by the 3D reconstruction network.
  • the second processing module is configured to input the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image into the time series feature extraction network to obtain the The time series feature of the i-th frame image, where the initial value of i is 1, where i is an integer greater than 1;
  • a third processing module configured to perform inputting the i+1 th frame image in the target video into the video frame encoding network to obtain the first image feature of the i+1 th frame image;
  • a three-dimensional reconstruction module configured to perform a three-dimensional reconstruction result of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time series feature of the i th frame image ;
  • an electronic device comprising:
  • a memory for storing the processor-executable instructions
  • the processor is configured to execute the instructions to implement the video image processing method described in the first aspect.
  • a computer-readable storage medium when the instructions in the computer-readable storage medium are executed by an electronic device, the electronic device can execute the above-mentioned first aspect.
  • Video image processing method when the instructions in the computer-readable storage medium are executed by an electronic device, the electronic device can execute the above-mentioned first aspect.
  • a computer program product including a computer program, which, when the computer program is executed by a processor, implements the video image processing method described in the first aspect.
  • the first image feature of the first frame of image output by the network wherein the first image feature is the image feature for the target object; the first image feature of the i-th frame image in the target video and The 3D reconstruction result of the target object in the ith frame of image is input to the time series feature extraction network to obtain the time series feature of the ith frame image, wherein the initial value of i is 1;
  • the +1 frame image is input to the video frame coding network to obtain the first image feature of the i+1 th frame image; based on the first image feature of the i+1 th frame image and the i th frame image Time sequence feature, generate the three-dimensional reconstruction result of the target object in the i+1 frame image; update the value of i to i+1, repeat the above-mentioned first image feature
  • the solution has the advantages of small calculation amount, high processing speed and high efficiency.
  • Fig. 1 is a flowchart of a video image processing method according to an exemplary embodiment.
  • Fig. 2 is a flowchart showing a three-dimensional reconstruction of a human body in a video image according to an exemplary embodiment.
  • Fig. 3 is a block diagram of a video image processing apparatus according to an exemplary embodiment.
  • FIG. 4 is a block diagram of an electronic device according to an exemplary embodiment.
  • a corresponding 3D image can be generated by performing 3D reconstruction on a target object in each frame of video image, such as a human body and a specific object, based on a video image.
  • the corresponding three-dimensional dynamic video images can be generated by continuously and rapidly playing the three-dimensional images corresponding to each frame of video images.
  • Fig. 1 is a flowchart of a video image processing method according to an exemplary embodiment. As shown in Fig. 1 , the method includes the following steps.
  • step S11 the first frame of image in the target video is respectively input to the 3D reconstruction network and the video frame coding network to obtain the 3D reconstruction result of the target object in the first frame of image output by the 3D reconstruction network, and The first image feature of the first frame image output by the video frame encoding network, wherein the first image feature is an image feature for the target object.
  • a pre-built 3D reconstruction network capable of performing accurate 3D reconstruction on the target object in the image may be used.
  • the first frame image in the target video that is, the first frame image
  • the relevant reconstructed data in are passed to the subsequent frame images for use.
  • the three-dimensional reconstruction network may identify the three-dimensional reconstruction-related feature information of the target object in the first frame of image, and based on the three-dimensional reconstruction-related feature information, determine the target object in the first frame of image
  • the object is subjected to three-dimensional reconstruction to obtain a three-dimensional reconstruction result of the target object in the first frame of image
  • the three-dimensional reconstruction result may be a three-dimensional reconstruction model of the target object, wherein the three-dimensional reconstruction-related feature information may be a three-dimensional reconstruction result.
  • Relevant feature information that needs to be used in reconstruction For example, when the target object is a human body image, the three-dimensional reconstruction related feature information may be feature information such as human body joint point position information, human body area information, etc.
  • the human body joint point position information may include The position information of each joint point of the human body in the video frame image, and the human body area information may refer to the position information of each pixel point in the human body image in the video frame image or the position information of each pixel point on the outline of the human body image in the video frame image. location information.
  • the target video can be any video that needs to generate a three-dimensional dynamic image
  • the target video can be an ordinary single-view color video
  • the target object can be any object in the target video that needs to be reconstructed in three dimensions, For example, an image of a human body, an image of a specific object, or an image of a building, etc.
  • the first frame image in the target video it can also be input into a video frame encoding network for image feature encoding processing, and then obtain the first image feature of the first frame image output by the video frame encoding network,
  • the first image feature is an image feature for the target object.
  • the first image feature may be a high-level image feature obtained by encoding the target object in the image by the video frame encoding network.
  • the target object is a human body image
  • the first image feature may be encoded feature information such as human body shape and human body posture, so that the human body shape and human body posture information in the corresponding image can be determined through the first image feature.
  • the 3D reconstruction result of the target object in the first frame image output by the 3D reconstruction network, and the first image feature of the first frame image output by the video frame encoding network can be Used to combine and extract the time series features in the first frame image.
  • step S12 the first image feature of the i-th frame image in the target video and the 3D reconstruction result of the target object in the i-th frame image are input into a time series feature extraction network to obtain the i-th frame image , where the initial value of i is 1.
  • step S13 the i+1 th frame image in the target video is input to the video frame coding network to obtain the first image feature of the i+1 th frame image.
  • step S14 a three-dimensional reconstruction result of the target object in the i+1 th frame image is generated based on the first image feature of the i+1 th frame image and the time series feature of the i th frame image.
  • the time series feature can be extracted through the time series feature extraction network.
  • the first image feature of each frame image and the frame image can be obtained in After the three-dimensional reconstruction result of the target object is obtained, the two are input into the time sequence feature extraction network, and the time sequence feature extraction network extracts the time sequence feature of the frame image.
  • the target object in the video frame encoding network can be encoded to obtain the first image feature.
  • the 3D reconstruction network and the video frame may be encoded from the first frame image in the target video, that is, the first frame image.
  • the three-dimensional reconstruction result of the target object in the first frame image and the first image feature of the first frame image respectively output by the network are directly input into the time series feature extraction network to obtain the time series feature extraction network output. Timing features of the first frame image.
  • the first image feature of the first frame of image and the three-dimensional reconstruction result of the target object in the first frame of image such as a three-dimensional reconstruction model, may be transformed through the time series feature extraction network.
  • the time series feature of the first frame image can be obtained, and extracting the time series feature is equivalent to multiplexing the first image feature of the first frame image and the three-dimensional reconstruction of the target object in the first frame image
  • the time series feature of the first frame image may include the first image feature and the three-dimensional reconstruction feature of the target object.
  • the second frame image in the target video can also be input into the video frame encoding network to obtain the first image feature of the second frame image, and the time sequence of the target object in the first frame image can be obtained.
  • the features are passed to the second frame image for 3D reconstruction.
  • the time series feature of the first frame of image and the first image feature of the second frame of image may be integrated to obtain the three-dimensional reconstruction-related feature information of the target object in the two-frame image, and based on The three-dimensional reconstruction-related feature information generates a three-dimensional reconstruction result of the target object in the second frame image.
  • the value of i can be increased by 1, that is, the value of i can be updated to 2, and the second frame of image in the target video can be started to be reconstructed.
  • the first image feature and the three-dimensional reconstruction result of the target object in the second frame image are input to the time series feature extraction network to obtain the time series feature of the second frame image;
  • the image is input to the video frame encoding network, the first image feature of the third frame image is obtained, and the first image feature of the third frame image and the time sequence feature of the second frame image are generated.
  • the three-dimensional reconstruction result of the target object in the three frames of images specifically, each process is similar to the corresponding processing method when i is equal to 1, and in order to avoid repetition, it is not repeated here.
  • the three-dimensional reconstruction result of the target object in each frame of images after the first frame of image in the target video may be generated frame by frame according to the above steps S12 to S15.
  • the step S14 includes:
  • Three-dimensional reconstruction is performed on the target object in the i+1 th frame image based on the fusion feature of the i+1 th frame image, and a 3D reconstruction result of the target object in the i+1 th frame image is obtained.
  • the fusion feature is also the feature information set for the target object in the current frame image.
  • feature fusion can be performed by means of splicing or addition.
  • the i+1th frame image is the current frame image, and can The first image feature of the current frame image and the time sequence feature of the previous frame image can be spliced, or the first image feature of the current frame image is replaced by the first image feature in the time sequence feature of the previous frame image, and the replaced The time series feature is used as the feature information set of the target object in the current frame image, that is, the fusion feature of the current frame image.
  • three-dimensional reconstruction of the target object in the i+1th frame image can be performed based on the fusion feature of the i+1th frame image to generate the target object.
  • the 3D model image of the object and determine the 3D feature information of the target object in the 3D model image. For example, when the target object is a human image, a 3D human image can be generated, and the 3D human joint position information, the surface 3D vertex position information, etc. can be determined. .
  • a 3D reconstruction model can be used to perform fast 3D reconstruction on the target object in the i+1 th frame image, and the 3D reconstruction model can use the overall feature information of the target object in a large number of video frame images as
  • the input training data and the corresponding 3D model of the target object are used as output training data, and the initial 3D reconstruction model is trained, and the initial 3D reconstruction model can be a simple convolutional neural network.
  • the time series feature of the previous frame image can be used directly to realize the rapid 3D modeling of the target object, that is, The time series characteristics of the previous frame of image are known, and the time series characteristics of the frame image can be cached when performing three-dimensional reconstruction of the target object in each frame of image, which is used to perform three-dimensional reconstruction of the target object in the next frame of image. Used when rebuilding.
  • the number of structural parameters of the three-dimensional reconstruction network is greater than the number of structural parameters of the video frame encoding network.
  • the 3D reconstruction network may be a large backbone convolutional neural network
  • the video frame encoding network may be a Lightweight Convolutional Neural Networks.
  • the large backbone convolutional neural network may be a convolutional neural network with more layers and more structural parameters
  • the lightweight neural network may be a convolutional neural network with fewer layers and fewer structural parameters
  • the number of structural parameters of the three-dimensional reconstruction network is greater than the number of structural parameters of the video frame encoding network.
  • the 3D reconstruction network may be obtained by training an initial 3D reconstruction network using a training image set marked with 3D reconstruction data of a first object, and the first object may be a specific type of the same type of the target object.
  • Objects, for example, the first object and the target object are both human body images.
  • the training process of the 3D reconstruction network includes:
  • the model parameters of the initial 3D reconstruction network are adjusted to obtain the trained 3D reconstruction network.
  • a large backbone convolutional neural network is used as the initial 3D reconstruction network, and a large number of 3D reconstructions marked with the first object are used.
  • the video frame images of the data are used as the training image set, that is, each training image in the training image set can be input into the initial 3D reconstruction network as input data, and the corresponding images in each training image outputted by the initial 3D reconstruction network can be used as input data.
  • the three-dimensional reconstruction data of the first object is used as the output data, and the three-dimensional reconstruction data of the first object in each marked training image can also be used as the output training data, by calculating the three-dimensional reconstruction data of each training image and each marked training image.
  • the model parameters of the initial 3D reconstruction network are adjusted by training, and the model parameters of the initial 3D reconstruction network are determined through repeated training processes and training objectives, and the trained 3D reconstruction network is obtained.
  • the training objective may be to minimize the error between the model output data and the labeled data, or to make the error less than a certain threshold.
  • the three-dimensional reconstruction network obtained by training can effectively and accurately perform three-dimensional reconstruction of the target object in the target video.
  • the large backbone neural network has more parameters and a large amount of calculation, it can ensure that the 3D reconstruction network obtained through training can accurately identify the 3D reconstruction data of the target object in the video frame image, and can accurately identify the 3D reconstruction data of the target object in the video frame image.
  • the video frame encoding network and the time series feature extraction network may be obtained by jointly training a lightweight convolutional neural network using a training video set marked with 3D reconstruction data of a second object, and the second object is also It may be a specific object of the same type as the target object, that is, the first object, the second object and the target object may all be objects of the same type, for example, all are human body images.
  • the training process of the video frame encoding network and the time series feature extraction network includes:
  • a 3D reconstruction result of the second object in the kth frame of training image is generated, where k is any integer between 2 and M;
  • the model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are adjusted to obtain the trained video frame encoding network and the video frame encoding network.
  • the 3D reconstruction network can be trained in the aforementioned manner, and the model parameters of the 3D reconstruction network can be fixed, and then the video frame coding network and the time series feature extraction network can be extracted by using the training video set. Do joint training.
  • a lightweight convolutional neural network can be used as the initial video frame encoding network, or another lightweight convolutional neural network can be used.
  • the convolutional neural network is used as the initial time series feature extraction network, and a large amount of video data marked with the three-dimensional reconstruction data of the second object can be used as the training video set, and each training video in the training video set can be used as input data, and all the training videos can be used as input data.
  • Each frame of training image in the training video is input into the initial video frame coding network frame by frame, and the corresponding time series feature of the second object in each frame of the training image in the training video output by the time series feature extraction network is output as the output. It is also possible to use the three-dimensional reconstruction data of the second object in each frame of the training image in each marked training video as the output training data to jointly train the initial video frame encoding network and the initial time series feature extraction network, and pass Calculate the error between the three-dimensional reconstruction data of each frame of training image in the training video and the marked three-dimensional reconstruction data of each frame of training image, and adjust the model parameters of the initial video frame encoding network and the initial time series feature extraction network, The model parameters of the initial video frame encoding network and the initial time series feature extraction network are determined through repeated training processes and training objectives, and the trained video frame encoding network and the initial time series feature extraction network are obtained. The goal may be to minimize the error between the model output data and the labeled data, or to keep the error smaller than a preset
  • the first frame of training images in the training videos in the training video set may be input into the trained 3D reconstruction network to obtain the obtained 3D reconstruction network.
  • the three-dimensional reconstruction result of the second object in the first frame of training image then, the three-dimensional reconstruction result of the second object in the first frame of the training image in the training video and the first frame of the training image
  • the second image features are input into the initial time series feature extraction network to obtain the time series features of the first frame of training images; and the next frame of training images in the training video, that is, the second frame of training images, can also be input into the initial video frame
  • the encoding network obtains the second image feature of the second frame of training image, and the second image feature is the image feature for the second object; thus, it can be based on the time series feature of the first frame of the training image in the training video and
  • the second image feature of the second frame of training image generates a three-dimensional reconstruction result of the second object in the second frame of training image.
  • the three-dimensional reconstruction result of the second object in the second frame of training image in the training video and the second image feature of the second frame of training image may be input into the initial time series feature extraction network to obtain the result.
  • the time sequence feature of the second frame of training image and can continue to input the next frame of training image in the training video, that is, the third frame of training image, into the initial video frame encoding network to obtain the third frame of training image.
  • the second image feature so that the 3D reconstruction of the second object in the third frame of training image can be generated based on the time series feature of the second frame of training image in the training video and the second image feature of the third frame of training image result.
  • the time series features of the training image of the current frame can be determined in a similar manner, and similar operations can be performed on the next frame to determine the target object in the training image of each frame.
  • the three-dimensional reconstruction result can be recorded, and the time-series features of each frame of training images output by the initial time-series feature extraction network can be recorded.
  • each process is similar to the related processing methods introduced above, and to avoid repetition, details are not described here.
  • model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are jointly adjusted until the trained video frame encoding network and the time series feature extraction network are obtained.
  • the video frame encoding network and the time series feature extraction network obtained by training can effectively and accurately perform fast encoding and time series feature extraction processing on each frame image in the target video.
  • the lightweight neural network has the characteristics of small parameter quantity and fast operation speed, it can ensure that the video frame coding network obtained through training can quickly identify the first image feature of the target object in the video frame image, thereby meeting the real-time and low-latency requirements. running requirements.
  • the second object in the kth frame of training image is generated based on the time series feature of the k-1th frame of training image in the training video and the second image feature of the kth frame of training image 3D reconstruction results, including:
  • the time sequence feature of the k-1th frame training image in the training video and the second image feature of the kth frame training image are fused to obtain the fusion feature of the kth frame training image;
  • adjusting the model parameters of the initial video frame coding network and the model parameters of the initial video frame coding network according to the second error includes:
  • the model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are adjusted with the aim of minimizing the second error as a training goal.
  • the error between the model output data and the labeled data can be minimized as the training target in the model training process.
  • the training target can be achieved by constructing a related loss function, and the error minimized model parameters of each training network.
  • the three-dimensional reconstruction data includes human body region positions and human body joint point positions
  • the second image features include body posture features
  • the second error includes human body joints projection error
  • the embodiment of the present disclosure can be applied to a scene of performing three-dimensional reconstruction of a human body image in a video, that is, the target object in the embodiment of the present disclosure can be a human body image, and the training target object is also a human body image correspondingly.
  • the second image features may include body and posture features, that is, they may include human body image features such as human body body features and human body posture features.
  • the first image features in this embodiment of the present disclosure also include body and posture features.
  • the three-dimensional reconstruction data It can include three-dimensional reconstruction related data such as the position of the human body region and the position of the human body joint points.
  • the second error may include a human body joint projection error
  • the first error page may correspondingly include a human body joint projection error. That is to say, in the training process of the three-dimensional reconstruction network, the video frame encoding network, and the time series feature extraction network, it is possible to minimize the joint projection error as the training goal, that is, during the training process, the three-dimensional human body output by the training network can be The difference between the position of the joint and the position of the human body marked in the training image, that is, the joint projection error keeps getting smaller until the error stabilizes at a small error, which can be very small to ensure that the trained correlation network has a relatively small value. high precision.
  • the video image processing method in the embodiment of the present disclosure can be used to perform three-dimensional reconstruction of the human body in the human body video image, and generate a corresponding three-dimensional dynamic image of the human body.
  • the three-dimensional reconstruction data further includes three-dimensional body data
  • the second error further includes vertex errors of the three-dimensional surface of the human body.
  • the second error may also include the vertex error of the three-dimensional surface of the human body, and the training target is correspondingly further. It can include minimizing the vertex error of the 3D surface of the human body, that is, in the training process, it can make the difference between the 3D vertex position of the surface in the 3D human reconstruction result output by the training network and the manually marked 3D vertex position of the human body surface, that is, the 3D surface. The vertex error keeps getting smaller until the error stabilizes at a small error.
  • the specific implementation of the video image processing method in the embodiment of the present disclosure will be illustrated by taking the target object as a human body image as an example:
  • the first frame image in the video can be input into the 3D human body reconstruction network, and the 3D reconstruction result of the human body in the first frame image can be obtained.
  • the 3D human body reconstruction network can be a large-scale backbone convolutional neural network, which has a large amount of parameters and a large amount of calculation, and can be trained by using massive single-frame human image annotation data. Precise 3D reconstruction of the human body in color images.
  • the first frame image can also be input into the video frame encoding network to obtain the high-level image features corresponding to the first frame image.
  • the video frame encoding network can be a lightweight convolutional neural network, which has the characteristics of small parameter quantity and fast operation speed. , which can meet the requirements of real-time and low-latency operation, and the high-level image features can be intermediate features output by some layers of the convolutional neural network, which encode feature information such as human body shape and posture.
  • the high-level image features of the first frame image output by the video frame encoding network and the 3D human body reconstruction results output by the 3D human body reconstruction network can be jointly input to the time series feature extraction network.
  • the function of the time series feature extraction network is to synthesize the current frame.
  • the high-level image features and the 3D human body reconstruction results are extracted from the time series features in the current frame and passed to the reconstruction process of the 3D human body model in the subsequent frames.
  • the second frame image in the video is input to the video frame coding network to obtain the corresponding high-level image features, and the high-level image features are compared with the time series features in the previous frame image passed in the previous step. Fusion is performed, and the fused features are regressed through a simple convolutional neural network, and the 3D human body reconstruction result of the second frame image can be obtained. Then, the above-mentioned process of synthesizing the high-level image features of the current frame and the three-dimensional human body reconstruction result to extract the time series features in the current frame may be repeated.
  • a method similar to the 3D human body reconstruction process of the second frame image can be used to obtain the 3D human body reconstruction result of each subsequent frame image, that is, using the time series features transmitted in the previous frame and the video frame encoding network extraction.
  • the high-level image features of the current frame reconstruct the 3D human body model of the current frame, and then generate the time series features of the current frame.
  • the first frame image in the target video is respectively input to a 3D reconstruction network and a video frame encoding network, and the target object in the first frame image output by the 3D reconstruction network is obtained.
  • the first image feature of the ith frame image and the three-dimensional reconstruction result of the target object in the ith frame image are input to the time series feature extraction network to obtain the time series feature of the ith frame image, wherein, the initial value of i is 1; input the i+1 th frame image in the target video into the video frame coding network to obtain the first image feature of the i+1 th frame image; based on the i+1 th frame image an image feature and the time sequence feature of the i-th frame image to generate a three-dimensional reconstruction result of the target object in the i+1-th frame image; update the value
  • the solution has the advantages of small calculation amount, high processing speed and high efficiency.
  • Fig. 3 is a block diagram of a video image processing apparatus according to an exemplary embodiment.
  • the video image processing apparatus includes a first processing module 301 , a second processing module 302 , a third processing module 303 , a three-dimensional reconstruction module 304 and an execution module 305 .
  • the first processing module 301 is configured to input the first frame image in the target video to the 3D reconstruction network and the video frame coding network respectively, and obtain the target object in the first frame image output by the 3D reconstruction network. a three-dimensional reconstruction result, and a first image feature of the first frame image output by the video frame encoding network, wherein the first image feature is an image feature for the target object;
  • the second processing module 302 is configured to input the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image into the time series feature extraction network to obtain the The time series feature of the i-th frame image, wherein, the initial value of i is 1, and wherein, i is an integer greater than 1;
  • the third processing module 303 is configured to input the i+1 th frame image in the target video into the video frame coding network to obtain the first image feature of the i+1 th frame image;
  • the 3D reconstruction module 304 is configured to perform a 3D reconstruction of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time series feature of the i th frame image result;
  • the number of structural parameters of the three-dimensional reconstruction network is greater than the number of structural parameters of the video frame encoding network.
  • the three-dimensional reconstruction module 304 includes:
  • a fusion unit configured to perform fusion of the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image to obtain the fusion feature of the i+1 th frame image;
  • a three-dimensional reconstruction unit configured to perform three-dimensional reconstruction of the target object in the i+1th frame image based on the fusion feature of the i+1th frame image, to obtain the target object in the i+1th frame image 3D reconstruction results.
  • the training process of the 3D reconstruction network includes:
  • the model parameters of the initial 3D reconstruction network are adjusted to obtain the trained 3D reconstruction network.
  • the training process of the video frame encoding network and the time series feature extraction network includes:
  • a 3D reconstruction result of the second object in the kth frame of training image is generated, where k is any integer between 2 and M;
  • the model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are adjusted to obtain the trained video frame encoding network and the video frame encoding network.
  • the second object in the kth frame of training image is generated based on the time series feature of the k-1th frame of training image in the training video and the second image feature of the kth frame of training image 3D reconstruction results, including:
  • the time sequence feature of the k-1th frame training image in the training video and the second image feature of the kth frame training image are fused to obtain the fusion feature of the kth frame training image;
  • three-dimensional reconstruction is performed on the second object in the kth frame of training image, and a three-dimensional reconstruction result of the second object in the kth frame of training image is obtained.
  • the three-dimensional reconstruction data includes human body region positions and human body joint point positions
  • the second image features include body posture features
  • the second error includes human body joints projection error
  • the three-dimensional reconstruction data further includes three-dimensional body data
  • the second error further includes vertex errors of the three-dimensional surface of the human body.
  • the video image processing apparatus 300 in this embodiment of the present disclosure inputs the first frame of image in the target video to a three-dimensional reconstruction network and a video frame encoding network respectively, and obtains the target in the first frame of image output by the three-dimensional reconstruction network
  • the three-dimensional reconstruction result of the object, and the first image feature of the first frame image output by the video frame encoding network wherein the first image feature is the image feature for the target object;
  • the first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image are input to the time series feature extraction network to obtain the time-series features of the i-th frame image, wherein the initial value of i is 1; input the i+1 th frame image in the target video into the video frame coding network to obtain the first image feature of the i+1 th frame image; based on the i+1 th frame image
  • the first image feature and the time sequence feature of the i-th frame image generate the three-dimensional reconstruction
  • the first image feature of the image and the three-dimensional reconstruction result of the target object in the i-th frame image are input to the time series feature extraction network to the first image feature based on the i+1-th frame image and the i-th frame image.
  • N is the total number of frames of the target video.
  • FIG. 4 is a block diagram of an electronic device 400 according to an exemplary embodiment.
  • a computer-readable storage medium including instructions such as a memory 410 including instructions, is also provided, and the instructions can be executed by the processor 420 of the electronic device 400 to complete the above video image processing method.
  • the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
  • the bus architecture may include any number of interconnected buses and bridges, in particular one or more processors, represented by processor 420, and various circuits of memory, represented by memory 410, linked together.
  • the bus architecture may also link together various other circuits, such as peripherals, voltage regulators, and power management circuits, which are well known in the art and, therefore, will not be described further herein.
  • the bus interface 430 provides the interface.
  • the processor 420 is responsible for managing the bus architecture and general processing, and the memory 410 may store data used by the processor 420 in performing operations.
  • a computer program product including a computer program, which implements the above-mentioned video image processing method when executed by a processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to a video image processing method and apparatus, and an electronic device and a storage medium. The method comprises: respectively inputting a first frame of image in a target video into a three-dimensional reconstruction network and a video frame coding network, so as to obtain a three-dimensional reconstruction result of a target object in the first frame of image and an image feature of the first frame of image; inputting, into a time sequence feature extraction network, an image feature of the ith frame of image in the target video and a three-dimensional reconstruction result corresponding to the ith frame of image, so as to obtain a time sequence feature of the ith frame of image; inputting the (i+1)th frame of image in the target video into the video frame coding network, so as to obtain an image feature of the (i+1)th frame of image; on the basis of the image feature of the (i+1)th frame of image and the time sequence feature of the ith frame of image, generating a three-dimensional reconstruction result corresponding to the (i+1)th frame of image; and updating the value of i to i+1, and repeatedly executing the steps from inputting same into the time sequence feature extraction network to generating the three-dimensional reconstruction result corresponding to the (i+1)th frame of image until i = N.

Description

视频图像处理方法及装置Video image processing method and device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开基于申请日为2020年12月31日、申请号为202011625995.2号的中国专利申请,并要求该中国专利申请的优先权,在此全文引用上述中国专利申请公开的内容以作为本公开的一部分。This disclosure is based on a Chinese patent application with an application date of December 31, 2020 and an application number of 202011625995.2, and claims the priority of the Chinese patent application. The disclosure of the above Chinese patent application is hereby incorporated by reference in its entirety as a part of this disclosure .
技术领域technical field
本公开涉及计算机处理技术领域,尤其涉及一种视频图像处理方法、装置、电子设备及存储介质。The present disclosure relates to the technical field of computer processing, and in particular, to a video image processing method, apparatus, electronic device, and storage medium.
背景技术Background technique
对视频图像中的人体姿势估计及人体三维模型重建旨在对各视频帧中的人体关节位置及人体表面三维模型进行恢复,该技术被广泛应用于安防、健康监控、计算机动画、虚拟现实、增强现实等场景。Human pose estimation and human body 3D model reconstruction in video images aims to restore the human body joint position and human body surface 3D model in each video frame. This technology is widely used in security, health monitoring, computer animation, virtual reality, augmentation realistic scenarios.
相关技术中,通常利用光流或循环神经网络(Recurrent Neural Network,RNN)提取时序信息来进行人体动态三维模型重建,该方案需要先提取输入视频图像中的光流信息,接着利用深度RNN网络或时序卷积网络对输入视频图像中的人体特征进行提取,然后利用提取的人体特征对人体姿态或三维模型进行回归。In related technologies, optical flow or Recurrent Neural Network (RNN) is usually used to extract timing information to reconstruct the dynamic 3D model of the human body. The time series convolutional network extracts the human body features in the input video image, and then uses the extracted human body features to regress the human body pose or 3D model.
发明内容SUMMARY OF THE INVENTION
本公开提供一种视频图像处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品。本公开的技术方案如下:The present disclosure provides a video image processing method, apparatus, electronic device, computer-readable storage medium, and computer program product. The technical solutions of the present disclosure are as follows:
根据本公开实施例的第一方面,提供一种视频图像处理方法,包括:According to a first aspect of the embodiments of the present disclosure, a video image processing method is provided, including:
将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;Inputting the first frame image in the target video to a 3D reconstruction network and a video frame encoding network respectively, to obtain a 3D reconstruction result of the target object in the first frame image output by the 3D reconstruction network, and the video frame encoding the first image feature of the first frame of image output by the network, wherein the first image feature is an image feature for the target object;
将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1;Inputting the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image to the time series feature extraction network to obtain the time series feature of the i-th frame image, wherein , the initial value of i is 1;
将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;Inputting the i+1th frame image in the target video to the video frame encoding network to obtain the first image feature of the i+1th frame image;
基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;generating a three-dimensional reconstruction result of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image;
将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。The value of i is updated to i+1, and the above-mentioned first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image are input into the time series feature extraction network to be based on the i-th frame image. The first image feature of the +1 frame image and the time series feature of the i-th frame image, the step of generating the 3D reconstruction result of the target object in the i+1-th frame image, until i=N, where N is the The total number of frames in the target video.
根据本公开实施例的第二方面,提供一种视频图像处理装置,包括:According to a second aspect of the embodiments of the present disclosure, there is provided a video image processing apparatus, including:
第一处理模块,被配置为执行将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;The first processing module is configured to input the first frame image in the target video to the 3D reconstruction network and the video frame coding network respectively, and obtain the 3D image of the target object in the first frame image output by the 3D reconstruction network. The reconstruction result, and the first image feature of the first frame image output by the video frame encoding network, wherein the first image feature is an image feature for the target object;
第二处理模块,被配置为执行将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1,其中,i为大于1的整数;The second processing module is configured to input the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image into the time series feature extraction network to obtain the The time series feature of the i-th frame image, where the initial value of i is 1, where i is an integer greater than 1;
第三处理模块,被配置为执行将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;a third processing module, configured to perform inputting the i+1 th frame image in the target video into the video frame encoding network to obtain the first image feature of the i+1 th frame image;
三维重建模块,被配置为执行基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;A three-dimensional reconstruction module configured to perform a three-dimensional reconstruction result of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time series feature of the i th frame image ;
执行模块,被配置为执行将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。The execution module is configured to update the value of i to i+1, and repeat the above-mentioned input of the first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image to the time series feature extraction Network to the step of generating the three-dimensional reconstruction result of the target object in the i+1th frame image based on the first image feature of the i+1th frame image and the time series feature of the ith frame image, until i= N, where N is the total number of frames of the target video.
根据本公开实施例的第三方面,提供一种电子设备,包括:According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, comprising:
处理器;processor;
用于存储所述处理器可执行指令的存储器;a memory for storing the processor-executable instructions;
其中,所述处理器被配置为执行所述指令,以实现上述第一方面所述的视频图像处理方法。Wherein, the processor is configured to execute the instructions to implement the video image processing method described in the first aspect.
根据本公开实施例的第四方面,提供一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备执行时,使得所述电子设备能够执行上述第一方面所述的视频图像处理方法。According to a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, when the instructions in the computer-readable storage medium are executed by an electronic device, the electronic device can execute the above-mentioned first aspect. Video image processing method.
根据本公开实施例的第五方面,提供一种计算机程序产品,包括计算机程序,当所述计算机程序被处理器执行时实现上述第一方面所述的视频图像处理方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program product, including a computer program, which, when the computer program is executed by a processor, implements the video image processing method described in the first aspect.
将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象 的图像特征;将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1;将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。Inputting the first frame image in the target video to a 3D reconstruction network and a video frame encoding network respectively, to obtain a 3D reconstruction result of the target object in the first frame image output by the 3D reconstruction network, and the video frame encoding The first image feature of the first frame of image output by the network, wherein the first image feature is the image feature for the target object; the first image feature of the i-th frame image in the target video and The 3D reconstruction result of the target object in the ith frame of image is input to the time series feature extraction network to obtain the time series feature of the ith frame image, wherein the initial value of i is 1; The +1 frame image is input to the video frame coding network to obtain the first image feature of the i+1 th frame image; based on the first image feature of the i+1 th frame image and the i th frame image Time sequence feature, generate the three-dimensional reconstruction result of the target object in the i+1 frame image; update the value of i to i+1, repeat the above-mentioned first image feature of the i frame image and the i frame image The three-dimensional reconstruction result of the target object in the image is input to the time series feature extraction network to generate the i+1th frame image based on the first image feature of the i+1th frame image and the time series feature of the ith frame image The step of the three-dimensional reconstruction result of the target object in , until i=N, where N is the total number of frames of the target video.
这样,通过使用三维重建网络对视频的第一帧图像中的目标对象进行三维重建,得到较为精准的三维重建结果,并对于视频中的后续每帧图像,通过结合第一帧图像中目标对象的三维重建结果和每帧图像的第一图像特征,便可实现快速地对每帧图像中的目标对象进行精确地三维重建。该方案相比相关技术中的方案,具有计算量小,处理速度快即效率高的优点。In this way, by using the 3D reconstruction network to perform 3D reconstruction on the target object in the first frame of the video, a relatively accurate 3D reconstruction result is obtained, and for each subsequent frame in the video, by combining the target object in the first frame of image The three-dimensional reconstruction result and the first image feature of each frame of image can quickly and accurately perform three-dimensional reconstruction of the target object in each frame of image. Compared with the solutions in the related art, the solution has the advantages of small calculation amount, high processing speed and high efficiency.
附图说明Description of drawings
图1是根据一示例性实施例示出的一种视频图像处理方法的流程图。Fig. 1 is a flowchart of a video image processing method according to an exemplary embodiment.
图2是根据一示例性实施例示出的一种对视频图像中的人体进行三维重建的流程图。Fig. 2 is a flowchart showing a three-dimensional reconstruction of a human body in a video image according to an exemplary embodiment.
图3是根据一示例性实施例示出的一种视频图像处理装置的框图。Fig. 3 is a block diagram of a video image processing apparatus according to an exemplary embodiment.
图4是根据一示例性实施例示出的电子设备的框图。FIG. 4 is a block diagram of an electronic device according to an exemplary embodiment.
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
本公开实施例可应用于三维动画制作、增强现实等场景,具体可基于视频图像,通过对各帧视频图像中的目标对象,如人体、特定物体等进行三维重建,来生成对应的三维图像,最终将各帧视频图像对应的三维图像连续快速播放便可生成对应的三维动态视频图像。The embodiments of the present disclosure can be applied to scenarios such as 3D animation production and augmented reality. Specifically, a corresponding 3D image can be generated by performing 3D reconstruction on a target object in each frame of video image, such as a human body and a specific object, based on a video image. Finally, the corresponding three-dimensional dynamic video images can be generated by continuously and rapidly playing the three-dimensional images corresponding to each frame of video images.
图1是根据一示例性实施例示出的一种视频图像处理方法的流程图,如图1所示,该 方法包括以下步骤。Fig. 1 is a flowchart of a video image processing method according to an exemplary embodiment. As shown in Fig. 1 , the method includes the following steps.
在步骤S11中,将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征。In step S11, the first frame of image in the target video is respectively input to the 3D reconstruction network and the video frame coding network to obtain the 3D reconstruction result of the target object in the first frame of image output by the 3D reconstruction network, and The first image feature of the first frame image output by the video frame encoding network, wherein the first image feature is an image feature for the target object.
本公开实施例中,为保证能够得到对所述目标视频中的目标对象的较为精确的三维重建结果,可以使用预先构建好的能够对图像中的目标对象进行精确三维重建的三维重建网络,对所述目标视频中的首帧图像也即第一帧图像进行三维重建,得到由所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,并可将该三维重建结果中的相关重建数据传递给后续帧图像使用。In the embodiment of the present disclosure, in order to ensure that a relatively accurate 3D reconstruction result for the target object in the target video can be obtained, a pre-built 3D reconstruction network capable of performing accurate 3D reconstruction on the target object in the image may be used, The first frame image in the target video, that is, the first frame image, is subjected to three-dimensional reconstruction to obtain a three-dimensional reconstruction result of the target object in the first frame image output by the three-dimensional reconstruction network, and the three-dimensional reconstruction result can be obtained. The relevant reconstructed data in are passed to the subsequent frame images for use.
在一些实施例中,所述三维重建网络可以通过识别所述第一帧图像中的目标对象的三维重建相关特征信息,并基于所述三维重建相关特征信息对所述第一帧图像中的目标对象进行三维重建,得到所述第一帧图像中的目标对象的三维重建结果,所述三维重建结果可以是所述目标对象的三维重建模型,其中,所述三维重建相关特征信息可以是进行三维重建中需要使用到的相关特征信息,例如,所述目标对象为人体图像时,所述三维重建相关特征信息可以是人体关节点位置信息、人体区域信息等特征信息,人体关节点位置信息可以包括人体的各关节点在视频帧图像中的位置信息,人体区域信息可以是指人体图像中的各像素点在视频帧图像中的位置信息或人体图像轮廓上的各像素点在视频帧图像中的位置信息。In some embodiments, the three-dimensional reconstruction network may identify the three-dimensional reconstruction-related feature information of the target object in the first frame of image, and based on the three-dimensional reconstruction-related feature information, determine the target object in the first frame of image The object is subjected to three-dimensional reconstruction to obtain a three-dimensional reconstruction result of the target object in the first frame of image, and the three-dimensional reconstruction result may be a three-dimensional reconstruction model of the target object, wherein the three-dimensional reconstruction-related feature information may be a three-dimensional reconstruction result. Relevant feature information that needs to be used in reconstruction. For example, when the target object is a human body image, the three-dimensional reconstruction related feature information may be feature information such as human body joint point position information, human body area information, etc. The human body joint point position information may include The position information of each joint point of the human body in the video frame image, and the human body area information may refer to the position information of each pixel point in the human body image in the video frame image or the position information of each pixel point on the outline of the human body image in the video frame image. location information.
其中,所述目标视频可以是任意需要生成三维动态图像的视频,且所述目标视频可以是普通的单视角彩***,所述目标对象可以是所述目标视频中任意需要进行三维重建的对象,例如,人体图像、特定物体图像或建筑物图像等。Wherein, the target video can be any video that needs to generate a three-dimensional dynamic image, and the target video can be an ordinary single-view color video, and the target object can be any object in the target video that needs to be reconstructed in three dimensions, For example, an image of a human body, an image of a specific object, or an image of a building, etc.
对于所述目标视频中的第一帧图像,还可以将其输入视频帧编码网络进行图像特征编码处理,进而得到由所述视频帧编码网络输出的所述第一帧图像的第一图像特征,所述第一图像特征为针对所述目标对象的图像特征,在一些实施例中,所述第一图像特征可以是所述视频帧编码网络对图像中的目标对象进行编码得到的高层级图像特征,例如,目标对象为人体图像时,第一图像特征可以是编码的人体形体、人体姿态等特征信息,从而通过所述第一图像特征,可以确定对应图像中的人体形体和人体姿态信息。For the first frame image in the target video, it can also be input into a video frame encoding network for image feature encoding processing, and then obtain the first image feature of the first frame image output by the video frame encoding network, The first image feature is an image feature for the target object. In some embodiments, the first image feature may be a high-level image feature obtained by encoding the target object in the image by the video frame encoding network. For example, when the target object is a human body image, the first image feature may be encoded feature information such as human body shape and human body posture, so that the human body shape and human body posture information in the corresponding image can be determined through the first image feature.
该步骤中,通过所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及通过所述视频帧编码网络输出的所述第一帧图像的第一图像特征,可以用于结合提取所述第一帧图像中的时序特征。In this step, the 3D reconstruction result of the target object in the first frame image output by the 3D reconstruction network, and the first image feature of the first frame image output by the video frame encoding network can be Used to combine and extract the time series features in the first frame image.
在步骤S12中,将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1。In step S12, the first image feature of the i-th frame image in the target video and the 3D reconstruction result of the target object in the i-th frame image are input into a time series feature extraction network to obtain the i-th frame image , where the initial value of i is 1.
在步骤S13中,将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到 所述第i+1帧图像的第一图像特征。In step S13, the i+1 th frame image in the target video is input to the video frame coding network to obtain the first image feature of the i+1 th frame image.
在步骤S14中,基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果。In step S14 , a three-dimensional reconstruction result of the target object in the i+1 th frame image is generated based on the first image feature of the i+1 th frame image and the time series feature of the i th frame image.
在步骤S15中,将i的值更新为i+1,重复执行上述步骤S12至步骤S14,直至i=N,其中,N为所述目标视频的总帧数。In step S15, the value of i is updated to i+1, and the above steps S12 to S14 are repeatedly performed until i=N, where N is the total number of frames of the target video.
本公开实施例中,上述步骤S12至步骤15可以是随i取值的不同,不断重复执行的步骤,整个循环过程自i=1开始直至i=N结束。In the embodiment of the present disclosure, the above-mentioned steps S12 to 15 may be steps that are repeatedly performed according to the value of i, and the entire loop process starts from i=1 and ends when i=N.
对于所述目标视频中的每一帧图像,均可以通过时序特征提取网络对其中的时序特征进行提取,在一些实施例中,可以在得到每一帧图像的第一图像特征和该帧图像中的目标对象的三维重建结果后,将二者输入所述时序特征提取网络,由所述时序特征提取网络提取出该帧图像的时序特征。For each frame of image in the target video, the time series feature can be extracted through the time series feature extraction network. In some embodiments, the first image feature of each frame image and the frame image can be obtained in After the three-dimensional reconstruction result of the target object is obtained, the two are input into the time sequence feature extraction network, and the time sequence feature extraction network extracts the time sequence feature of the frame image.
对于所述目标视频中除所述第一帧图像之后的每一帧图像,均可以通过所述视频帧编码网络对其中的目标对象进行编码,得到第一图像特征。For each frame of image in the target video except the first frame of image, the target object in the video frame encoding network can be encoded to obtain the first image feature.
在一些实施例中,由于i的初始值为1,故可以先从所述目标视频中的第1帧图像也即所述第一帧图像开始,将所述三维重建网络和所述视频帧编码网络分别输出的所述第1帧图像中的目标对象的三维重建结果和所述第1帧图像的第一图像特征直接输入所述时序特征提取网络,得到所述时序特征提取网络输出的所述第1帧图像的时序特征。在一些实施例中,可以是将所述第1帧图像的第一图像特征和所述第1帧图像中的目标对象的三维重建结果,如三维重建模型,经过所述时序特征提取网络进行变换后,可得到所述第1帧图像的时序特征,提取所述时序特征也即相当于是复用所述第1帧图像的第一图像特征和所述第1帧图像中的目标对象的三维重建结果,也就是说,所述第1帧图像的时序特征可包括所述目标对象的第一图像特征和三维重建特征。In some embodiments, since the initial value of i is 1, the 3D reconstruction network and the video frame may be encoded from the first frame image in the target video, that is, the first frame image. The three-dimensional reconstruction result of the target object in the first frame image and the first image feature of the first frame image respectively output by the network are directly input into the time series feature extraction network to obtain the time series feature extraction network output. Timing features of the first frame image. In some embodiments, the first image feature of the first frame of image and the three-dimensional reconstruction result of the target object in the first frame of image, such as a three-dimensional reconstruction model, may be transformed through the time series feature extraction network. Then, the time series feature of the first frame image can be obtained, and extracting the time series feature is equivalent to multiplexing the first image feature of the first frame image and the three-dimensional reconstruction of the target object in the first frame image As a result, that is to say, the time series feature of the first frame image may include the first image feature and the three-dimensional reconstruction feature of the target object.
还可将所述目标视频中的第2帧图像输入至所述视频帧编码网络,得到所述第2帧图像的第一图像特征,并可将所述第1帧图像中的目标对象的时序特征传递给第2帧图像进行三维重建使用。在一些实施例中,可以综合所述第1帧图像的时序特征和所述第2帧图像的第一图像特征,来得到所述2帧图像中的目标对象的三维重建相关特征信息,并基于该三维重建相关特征信息生成所述第2帧图像中的目标对象的三维重建结果。The second frame image in the target video can also be input into the video frame encoding network to obtain the first image feature of the second frame image, and the time sequence of the target object in the first frame image can be obtained. The features are passed to the second frame image for 3D reconstruction. In some embodiments, the time series feature of the first frame of image and the first image feature of the second frame of image may be integrated to obtain the three-dimensional reconstruction-related feature information of the target object in the two-frame image, and based on The three-dimensional reconstruction-related feature information generates a three-dimensional reconstruction result of the target object in the second frame image.
然后,在生成所述第2帧图像中的目标对象的三维重建结果后,可以将i的值加1,即可将i的值更新为2,开始将所述目标视频中的第2帧图像的第一图像特征和所述第2帧图像中的目标对象的三维重建结果输入至所述时序特征提取网络,得到所述第2帧图像的时序特征;将所述目标视频中的第3帧图像输入至所述视频帧编码网络,得到所述第3帧图像的第一图像特征,基于所述第3帧图像的第一图像特征和所述第2帧图像的时序特征,生成所述第3帧图像中的目标对象的三维重建结果;具体地各个过程均与i等于1时相对应的处理方式类似,为避免重复,此处不再赘述。Then, after the 3D reconstruction result of the target object in the second frame of image is generated, the value of i can be increased by 1, that is, the value of i can be updated to 2, and the second frame of image in the target video can be started to be reconstructed. The first image feature and the three-dimensional reconstruction result of the target object in the second frame image are input to the time series feature extraction network to obtain the time series feature of the second frame image; The image is input to the video frame encoding network, the first image feature of the third frame image is obtained, and the first image feature of the third frame image and the time sequence feature of the second frame image are generated. The three-dimensional reconstruction result of the target object in the three frames of images; specifically, each process is similar to the corresponding processing method when i is equal to 1, and in order to avoid repetition, it is not repeated here.
这样,可以在每生成得到一帧图像的三维重建结果后,便将i的值加1,并按照上述 类似过程生成所述目标视频中后续每帧图像中的目标对象的三维重建结果。In this way, after each generation obtains the three-dimensional reconstruction result of a frame of image, just add 1 to the value of i, and generate the three-dimensional reconstruction result of the target object in each subsequent frame image in the described target video according to the above-mentioned similar process.
也就是说,本公开实施例中,可以按照上述步骤S12至步骤S15,逐帧生成所述目标视频中第1帧图像之后的每一帧图像中的目标对象的三维重建结果。That is to say, in the embodiment of the present disclosure, the three-dimensional reconstruction result of the target object in each frame of images after the first frame of image in the target video may be generated frame by frame according to the above steps S12 to S15.
在一些实施例中,所述步骤S14包括:In some embodiments, the step S14 includes:
对所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征进行融合,得到所述第i+1帧图像的融合特征;Fusing the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image to obtain the fusion feature of the i+1 th frame image;
基于所述第i+1帧图像的融合特征对所述第i+1帧图像中的目标对象进行三维重建,得到所述第i+1帧图像中的目标对象的三维重建结果。Three-dimensional reconstruction is performed on the target object in the i+1 th frame image based on the fusion feature of the i+1 th frame image, and a 3D reconstruction result of the target object in the i+1 th frame image is obtained.
即在得到第i+1帧图像的第一图像特征和第i帧图像的时序特征后,可以对这两种特征信息进行融合,以得到所述第i+1帧图像的融合特征,所述融合特征也即针对当前帧图像中的目标对象的特征信息集合,在一些实施例中,可以采用拼接或加和等方式进行特征融合,例如,第i+1帧图像为当前帧图像,可对当前帧图像的第一图像特征和上一帧图像的时序特征进行拼接即可,或者,将当前帧图像的第一图像特征替代上一帧图像的时序特征中的第一图像特征,替换后的时序特征作为当前帧图像中的目标对象的特征信息集合,也即为当前帧图像的融合特征。That is, after obtaining the first image feature of the i+1 th frame image and the time series feature of the i th frame image, these two kinds of feature information can be fused to obtain the fusion feature of the i+1 th frame image, the said The fusion feature is also the feature information set for the target object in the current frame image. In some embodiments, feature fusion can be performed by means of splicing or addition. For example, the i+1th frame image is the current frame image, and can The first image feature of the current frame image and the time sequence feature of the previous frame image can be spliced, or the first image feature of the current frame image is replaced by the first image feature in the time sequence feature of the previous frame image, and the replaced The time series feature is used as the feature information set of the target object in the current frame image, that is, the fusion feature of the current frame image.
在得到所述第i+1帧图像的融合特征后,便可基于所述第i+1帧图像的融合特征对所述第i+1帧图像中的目标对象进行三维重建,生成所述目标对象的三维模型图像,并确定所述三维模型图像中目标对象的三维特征信息,例如,目标对象为人体图像时,可以生成三维人体图像,并确定三维人体关节位置信息、表面三维顶点位置信息等。在一些实施例中,可以利用一三维重建模型来对所述第i+1帧图像中的目标对象进行快速三维重建,该三维重建模型可以利用大量视频帧图像中的目标对象的总体特征信息作为输入训练数据和相应的目标对象的三维模型作为输出训练数据,对初始三维重建模型进行训练得到的,该初始三维重建模型可以是简单的卷积神经网络。After the fusion feature of the i+1th frame image is obtained, three-dimensional reconstruction of the target object in the i+1th frame image can be performed based on the fusion feature of the i+1th frame image to generate the target object. The 3D model image of the object, and determine the 3D feature information of the target object in the 3D model image. For example, when the target object is a human image, a 3D human image can be generated, and the 3D human joint position information, the surface 3D vertex position information, etc. can be determined. . In some embodiments, a 3D reconstruction model can be used to perform fast 3D reconstruction on the target object in the i+1 th frame image, and the 3D reconstruction model can use the overall feature information of the target object in a large number of video frame images as The input training data and the corresponding 3D model of the target object are used as output training data, and the initial 3D reconstruction model is trained, and the initial 3D reconstruction model can be a simple convolutional neural network.
这样,通过该实施方式,可以通过特征融合和简单的模型回归处理,便可快速构建得到所述第i+1帧图像中的目标对象的较为精确的三维重建结果。In this way, through this embodiment, a relatively accurate three-dimensional reconstruction result of the target object in the i+1 th frame image can be quickly constructed and obtained through feature fusion and simple model regression processing.
需说明的是,本公开实施例中,在准备对当前帧图像中的目标对象进行三维重建时,可以直接利用上一帧图像的时序特征来实现对所述目标对象的快速三维建模,即上一帧图像的时序特征是已知的,可以在对每一帧图像中的目标对象进行三维重建时,缓存该帧图像的时序特征,用于在对下一帧图像中的目标对象进行三维重建时使用。It should be noted that, in the embodiment of the present disclosure, when preparing to perform 3D reconstruction on the target object in the current frame image, the time series feature of the previous frame image can be used directly to realize the rapid 3D modeling of the target object, that is, The time series characteristics of the previous frame of image are known, and the time series characteristics of the frame image can be cached when performing three-dimensional reconstruction of the target object in each frame of image, which is used to perform three-dimensional reconstruction of the target object in the next frame of image. Used when rebuilding.
这样,在对当前帧图像中的目标对象进行三维重建时,只需识别其中的第一图像特征,而无需再识别其他特征信息,其他特征信息可以直接从上一帧图像的三维重建结果中获取即可,进而可大幅降低计算量,以及提高对每帧图像进行三维重建的速度。In this way, when performing 3D reconstruction on the target object in the current frame image, only the first image feature in the image needs to be identified, and other feature information does not need to be identified. Other feature information can be directly obtained from the 3D reconstruction result of the previous frame image. That is, the calculation amount can be greatly reduced, and the speed of three-dimensional reconstruction of each frame of image can be improved.
在一些实施例中,所述三维重建网络的结构参数数量大于所述视频帧编码网络的结构参数数量。In some embodiments, the number of structural parameters of the three-dimensional reconstruction network is greater than the number of structural parameters of the video frame encoding network.
即本公开实施例中,为确保对所述目标视频中的目标对象的三维重建速度和精度,所 述三维重建网络可以是一大型骨干卷积神经网络,所述视频帧编码网络则可以是一轻量化卷积神经网络。其中,所述大型骨干卷积神经网络可以是层级数较多,结构参数较多的卷积神经网络;所述轻量化神经网络可以是层级数较少,结构参数较少的卷积神经网络,且所述三维重建网络的结构参数数量大于所述视频帧编码网络的结构参数数量。That is, in the embodiment of the present disclosure, in order to ensure the speed and accuracy of the 3D reconstruction of the target object in the target video, the 3D reconstruction network may be a large backbone convolutional neural network, and the video frame encoding network may be a Lightweight Convolutional Neural Networks. Wherein, the large backbone convolutional neural network may be a convolutional neural network with more layers and more structural parameters; the lightweight neural network may be a convolutional neural network with fewer layers and fewer structural parameters, And the number of structural parameters of the three-dimensional reconstruction network is greater than the number of structural parameters of the video frame encoding network.
其中,所述三维重建网络可以是利用标注有第一对象的三维重建数据的训练图像集对初始三维重建网络进行训练得到的,所述第一对象可以是与所述目标对象的类型相同的特定对象,例如,所述第一对象和所述目标对象均为人体图像。The 3D reconstruction network may be obtained by training an initial 3D reconstruction network using a training image set marked with 3D reconstruction data of a first object, and the first object may be a specific type of the same type of the target object. Objects, for example, the first object and the target object are both human body images.
在一些实施例中,所述三维重建网络的训练过程包括:In some embodiments, the training process of the 3D reconstruction network includes:
获取标注有第一对象的三维重建数据的训练图像集,其中,所述第一对象的类型与所述目标对象的类型相同;acquiring a training image set marked with three-dimensional reconstruction data of a first object, wherein the type of the first object is the same as the type of the target object;
将所述训练图像集中的训练图像输入初始三维重建网络,得到各训练图像的三维重建数据;inputting the training images in the training image set into the initial three-dimensional reconstruction network to obtain three-dimensional reconstruction data of each training image;
计算所述各训练图像的三维重建数据与标注的各训练图像的三维重建数据之间的第一误差;calculating the first error between the three-dimensional reconstruction data of each training image and the marked three-dimensional reconstruction data of each training image;
基于所述第一误差,对所述初始三维重建网络的模型参数进行调整,得到训练好的所述三维重建网络。Based on the first error, the model parameters of the initial 3D reconstruction network are adjusted to obtain the trained 3D reconstruction network.
在一些实施例中,为保证能够对待处理视频中的首帧图像中的目标对象进行精确三维重建,采用大型骨干卷积神经网络作为初始三维重建网络,并使用大量标注有第一对象的三维重建数据的视频帧图像作为训练图像集,即可将所述训练图像集中的各训练图像作为输入数据输入所述初始三维重建网络,相应的通过所述初始三维重建网络输出的对各训练图像中的第一对象的三维重建数据作为输出数据,还可以使用标注的各训练图像中的第一对象的三维重建数据作为输出训练数据,通过计算所述各训练图像的三维重建数据与标注的各训练图像的三维重建数据之间的误差,对所述初始三维重建网络的模型参数进行训练调整,通过反复的训练过程和训练目标确定所述初始三维重建网络的模型参数,得到训练好的所述三维重建网络,所述训练目标可以是使模型输出数据与标注数据的误差最小化,或使所述误差小于一定阈值。In some embodiments, to ensure accurate 3D reconstruction of the target object in the first frame of the video to be processed, a large backbone convolutional neural network is used as the initial 3D reconstruction network, and a large number of 3D reconstructions marked with the first object are used. The video frame images of the data are used as the training image set, that is, each training image in the training image set can be input into the initial 3D reconstruction network as input data, and the corresponding images in each training image outputted by the initial 3D reconstruction network can be used as input data. The three-dimensional reconstruction data of the first object is used as the output data, and the three-dimensional reconstruction data of the first object in each marked training image can also be used as the output training data, by calculating the three-dimensional reconstruction data of each training image and each marked training image. The model parameters of the initial 3D reconstruction network are adjusted by training, and the model parameters of the initial 3D reconstruction network are determined through repeated training processes and training objectives, and the trained 3D reconstruction network is obtained. The training objective may be to minimize the error between the model output data and the labeled data, or to make the error less than a certain threshold.
这样,通过上述训练过程,可保证训练得到的三维重建网络能有效精确地对目标视频中的目标对象进行三维重建。且由于大型骨干神经网络具有较多的参数量和较大的计算量,从而可保证通过训练得到的三维重建网络能够精确识别视频帧图像中的目标对象的三维重建数据,并对视频帧图像中的目标对象进行精确地三维重建,且由于只需利用所述三维重建网络对所述目标视频中的第一帧图像进行处理,对于其他帧图像的处理速度较快,故可以达到兼顾三维模型重建精确度和处理速度的目的。In this way, through the above training process, it can be ensured that the three-dimensional reconstruction network obtained by training can effectively and accurately perform three-dimensional reconstruction of the target object in the target video. And because the large backbone neural network has more parameters and a large amount of calculation, it can ensure that the 3D reconstruction network obtained through training can accurately identify the 3D reconstruction data of the target object in the video frame image, and can accurately identify the 3D reconstruction data of the target object in the video frame image. accurate 3D reconstruction of the target object, and since only the first frame image in the target video needs to be processed by the 3D reconstruction network, the processing speed for other frame images is faster, so the 3D model reconstruction can be achieved. for accuracy and processing speed.
其中,所述视频帧编码网络和所述时序特征提取网络可以是利用标注有第二对象的三维重建数据的训练视频集对轻量化卷积神经网络进行联合训练得到的,所述第二对象也可以是与所述目标对象的类型相同的特定对象,即所述第一对象、所述第二对象和所述目标 对象均可以是同一类型对象,例如,均为人体图像。Wherein, the video frame encoding network and the time series feature extraction network may be obtained by jointly training a lightweight convolutional neural network using a training video set marked with 3D reconstruction data of a second object, and the second object is also It may be a specific object of the same type as the target object, that is, the first object, the second object and the target object may all be objects of the same type, for example, all are human body images.
在一些实施例中,所述视频帧编码网络和所述时序特征提取网络的训练过程包括:In some embodiments, the training process of the video frame encoding network and the time series feature extraction network includes:
获取标注有第二对象的三维重建数据的训练视频集,其中,所述第二对象的类型与所述目标对象的类型相同;Obtaining a training video set marked with three-dimensional reconstruction data of a second object, wherein the type of the second object is the same as the type of the target object;
将所述训练视频集中的训练视频中的第一帧训练图像输入至训练好的所述三维重建网络,得到所述第一帧训练图像中的第二对象的三维重建结果;inputting the first frame of training images in the training videos in the training video set into the trained three-dimensional reconstruction network to obtain a three-dimensional reconstruction result of the second object in the first frame of training images;
将所述训练视频中的每帧训练图像分别输入至初始视频帧编码网络,得到所述每帧训练图像的第二图像特征,所述第二图像特征为针对所述第二对象的图像特征;Inputting each frame of training image in the training video to the initial video frame encoding network respectively, to obtain the second image feature of the training image of each frame, and the second image feature is the image feature for the second object;
将所述训练视频中的第j帧训练图像中的第二对象的三维重建结果和所述第j帧训练图像的第二图像特征输入至初始时序特征提取网络,得到所述第j帧训练图像的时序特征,其中,j为1至M之间的任意整数,M为所述训练视频的总帧数;Input the 3D reconstruction result of the second object in the jth frame training image in the training video and the second image feature of the jth frame training image into the initial time series feature extraction network to obtain the jth frame training image , where j is any integer between 1 and M, and M is the total number of frames of the training video;
基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果,其中,k为2至M之间的任意整数;Based on the time series feature of the k-1th frame of training image in the training video and the second image feature of the kth frame of training image, a 3D reconstruction result of the second object in the kth frame of training image is generated, where k is any integer between 2 and M;
计算所述每帧训练图像的时序特征中对应的三维重建数据与标注的所述每帧训练图像的三维重建数据之间的第二误差;Calculate the second error between the corresponding three-dimensional reconstruction data in the time series feature of each frame of training image and the marked three-dimensional reconstruction data of each frame of training image;
根据所述第二误差,对所述初始视频帧编码网络的模型参数和所述初始视频帧编码网络的模型参数进行调整,得到训练好的所述视频帧编码网络和所述视频帧编码网络。According to the second error, the model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are adjusted to obtain the trained video frame encoding network and the video frame encoding network.
即本公开实施例中,可以先对所述三维重建网络按照前述方式进行训练,固定好所述三维重建网络的模型参数后,再利用训练视频集对所述视频帧编码网络和时序特征提取网络进行联合训练。That is, in the embodiment of the present disclosure, the 3D reconstruction network can be trained in the aforementioned manner, and the model parameters of the 3D reconstruction network can be fixed, and then the video frame coding network and the time series feature extraction network can be extracted by using the training video set. Do joint training.
其中,为保证能够对待处理视频中第一帧图像之后的每帧图像中的目标对象进行快速三维重建,可以采用一轻量化卷积神经网络作为初始视频帧编码网络,也可以采用另一轻量化卷积神经网络作为初始时序特征提取网络,并可使用大量标注有第二对象的三维重建数据的视频数据作为训练视频集,并可将所述训练视频集中的各训练视频作为输入数据,将所述训练视频中各帧训练图像逐帧输入所述初始视频帧编码网络,相应的通过所述时序特征提取网络输出的对所述训练视频中各帧训练图像中的第二对象的时序特征作为输出数据,还可以使用标注的各训练视频中各帧训练图像中的第二对象的三维重建数据作为输出训练数据,对所述初始视频帧编码网络和所述初始时序特征提取网络进行联合训练,通过计算所述训练视频中各帧训练图像的三维重建数据与标注的各帧训练图像的三维重建数据之间的误差,调整所述初始视频帧编码网络和所述初始时序特征提取网络的模型参数,通过反复的训练过程和训练目标确定所述初始视频帧编码网络和所述初始时序特征提取网络的模型参数,得到训练好的所述视频帧编码网络和所述初始时序特征提取网络,所述训练目标可以是使模型输出数据与标注数据的误差最小,或使所述误差小于预设阈值。Among them, in order to ensure fast three-dimensional reconstruction of the target object in each frame of image after the first frame of image in the video to be processed, a lightweight convolutional neural network can be used as the initial video frame encoding network, or another lightweight convolutional neural network can be used. The convolutional neural network is used as the initial time series feature extraction network, and a large amount of video data marked with the three-dimensional reconstruction data of the second object can be used as the training video set, and each training video in the training video set can be used as input data, and all the training videos can be used as input data. Each frame of training image in the training video is input into the initial video frame coding network frame by frame, and the corresponding time series feature of the second object in each frame of the training image in the training video output by the time series feature extraction network is output as the output. It is also possible to use the three-dimensional reconstruction data of the second object in each frame of the training image in each marked training video as the output training data to jointly train the initial video frame encoding network and the initial time series feature extraction network, and pass Calculate the error between the three-dimensional reconstruction data of each frame of training image in the training video and the marked three-dimensional reconstruction data of each frame of training image, and adjust the model parameters of the initial video frame encoding network and the initial time series feature extraction network, The model parameters of the initial video frame encoding network and the initial time series feature extraction network are determined through repeated training processes and training objectives, and the trained video frame encoding network and the initial time series feature extraction network are obtained. The goal may be to minimize the error between the model output data and the labeled data, or to keep the error smaller than a preset threshold.
在一些实施例中,在上述训练过程中,可以先将所述训练视频集中的训练视频中的首 帧训练图像也即第1帧训练图像输入至已训练好的所述三维重建网络,得到所述第一帧训练图像中的第二对象的三维重建结果;然后,可将所述训练视频中的第1帧训练图像中的第二对象的三维重建结果和所述第1帧训练图像的第二图像特征输入至初始时序特征提取网络,得到所述第1帧训练图像的时序特征;并且还可将所述训练视频中的下一帧训练图像也即第2帧训练图像输入至初始视频帧编码网络,得到第2帧训练图像的第二图像特征,所述第二图像特征为针对所述第二对象的图像特征;从而可基于所述训练视频中的第1帧训练图像的时序特征和第2帧训练图像的第二图像特征,生成所述第2帧训练图像中的第二对象的三维重建结果。In some embodiments, in the above training process, the first frame of training images in the training videos in the training video set, that is, the first frame of training images, may be input into the trained 3D reconstruction network to obtain the obtained 3D reconstruction network. The three-dimensional reconstruction result of the second object in the first frame of training image; then, the three-dimensional reconstruction result of the second object in the first frame of the training image in the training video and the first frame of the training image The second image features are input into the initial time series feature extraction network to obtain the time series features of the first frame of training images; and the next frame of training images in the training video, that is, the second frame of training images, can also be input into the initial video frame The encoding network obtains the second image feature of the second frame of training image, and the second image feature is the image feature for the second object; thus, it can be based on the time series feature of the first frame of the training image in the training video and The second image feature of the second frame of training image generates a three-dimensional reconstruction result of the second object in the second frame of training image.
类似地,可将所述训练视频中的第2帧训练图像中的第二对象的三维重建结果和所述第2帧训练图像的第二图像特征输入至所述初始时序特征提取网络,得到所述第2帧训练图像的时序特征,并且还可继续将所述训练视频中的下一帧训练图像也即第3帧训练图像输入至所述初始视频帧编码网络,得到第3帧训练图像的第二图像特征,从而可基于所述训练视频中的第2帧训练图像的时序特征和第3帧训练图像的第二图像特征,生成所述第3帧训练图像中的第二对象的三维重建结果。这样,可以在每输出一帧训练图像的三维重建结果后,便按类似方式确定当前帧训练图像的时序特征,以及继续对下一帧执行相似操作,确定每一帧训练图像中的目标对象的三维重建结果,并可记录所述初始时序特征提取网络输出的每一帧训练图像的时序特征。其中,具体地各个过程均与前述介绍的相关处理方式类似,为避免重复,此处不再赘述。Similarly, the three-dimensional reconstruction result of the second object in the second frame of training image in the training video and the second image feature of the second frame of training image may be input into the initial time series feature extraction network to obtain the result. The time sequence feature of the second frame of training image, and can continue to input the next frame of training image in the training video, that is, the third frame of training image, into the initial video frame encoding network to obtain the third frame of training image. the second image feature, so that the 3D reconstruction of the second object in the third frame of training image can be generated based on the time series feature of the second frame of training image in the training video and the second image feature of the third frame of training image result. In this way, after each output of the three-dimensional reconstruction result of a frame of training image, the time series features of the training image of the current frame can be determined in a similar manner, and similar operations can be performed on the next frame to determine the target object in the training image of each frame. The three-dimensional reconstruction result can be recorded, and the time-series features of each frame of training images output by the initial time-series feature extraction network can be recorded. Specifically, each process is similar to the related processing methods introduced above, and to avoid repetition, details are not described here.
最后,通过计算所述每帧训练图像的时序特征中对应的三维重建数据与标注的所述每帧训练图像的三维重建数据之间的误差,并根据该误差,对所述初始视频帧编码网络的模型参数和所述初始视频帧编码网络的模型参数进行联合调整,直至得到训练好的所述视频帧编码网络和所述时序特征提取网络。Finally, by calculating the error between the corresponding 3D reconstruction data in the time series feature of each frame of training image and the labeled 3D reconstruction data of each frame of training image, and according to the error, encode the network for the initial video frame The model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are jointly adjusted until the trained video frame encoding network and the time series feature extraction network are obtained.
这样,通过上述训练过程,可保证训练得到的视频帧编码网络和时序特征提取网络能有效精确地对目标视频中各帧图像进行快速编码和时序特征提取处理。且由于轻量化神经网络具有参数量小、运算速度快的特点,从而可保证通过训练得到的视频帧编码网络能够快速识别视频帧图像中的目标对象的第一图像特征,进而满足实时、低延迟运行的要求。In this way, through the above training process, it can be ensured that the video frame encoding network and the time series feature extraction network obtained by training can effectively and accurately perform fast encoding and time series feature extraction processing on each frame image in the target video. And because the lightweight neural network has the characteristics of small parameter quantity and fast operation speed, it can ensure that the video frame coding network obtained through training can quickly identify the first image feature of the target object in the video frame image, thereby meeting the real-time and low-latency requirements. running requirements.
在一些实施例中,所述基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果,包括:In some embodiments, the second object in the kth frame of training image is generated based on the time series feature of the k-1th frame of training image in the training video and the second image feature of the kth frame of training image 3D reconstruction results, including:
对所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征进行融合,得到所述第k帧训练图像的融合特征;The time sequence feature of the k-1th frame training image in the training video and the second image feature of the kth frame training image are fused to obtain the fusion feature of the kth frame training image;
其中,模型训练过程中如何基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果的实施方式与前述对步骤S14的细化实施方式类似,具体可参见前述相关介绍,为避免重复,此处不再赘述。Wherein, in the model training process, how to generate the second object in the kth frame of training image based on the time sequence feature of the k-1th frame of training image and the second image feature of the kth frame of training image in the training video The implementation manner of the three-dimensional reconstruction result is similar to the foregoing detailed implementation manner of step S14 , for details, please refer to the foregoing related introduction. To avoid repetition, details are not repeated here.
这样,通过该实施方式,可以通过特征融合和简单的模型回归处理,可在训练过程中 实现快速构建得到所述第k帧图像中的目标对象的较为精确的三维重建结果。In this way, through this embodiment, through feature fusion and simple model regression processing, a relatively accurate three-dimensional reconstruction result of the target object in the k-th frame image can be quickly constructed and obtained in the training process.
在一些实施例中,所述根据所述第二误差,对所述初始视频帧编码网络的模型参数和所述初始视频帧编码网络的模型参数进行调整,包括:In some embodiments, adjusting the model parameters of the initial video frame coding network and the model parameters of the initial video frame coding network according to the second error includes:
以最小化所述第二误差为训练目标,对所述初始视频帧编码网络的模型参数和所述初始视频帧编码网络的模型参数进行调整。The model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are adjusted with the aim of minimizing the second error as a training goal.
即一种实施方式中,可以最小化模型输出数据与标注数据的误差作为模型训练过程中的训练目标,在一些实施例中,可通过构建相关损失函数来实现该训练目标,计算出误差最小化时各训练网络的模型参数。That is, in one embodiment, the error between the model output data and the labeled data can be minimized as the training target in the model training process. In some embodiments, the training target can be achieved by constructing a related loss function, and the error minimized model parameters of each training network.
在一些实施例中,所述第二对象为人体图像时,所述三维重建数据包括人体区域位置和人体关节点位置,所述第二图像特征包括形体姿态特征,所述第二误差包括人体关节投影误差。In some embodiments, when the second object is a human body image, the three-dimensional reconstruction data includes human body region positions and human body joint point positions, the second image features include body posture features, and the second error includes human body joints projection error.
一种具体的实施方式中,本公开实施例可以应用于对视频中的人体图像进行三维重建的场景,即本公开实施例中的目标对象可以是人体图像,训练目标对象也相应为人体图像,所述第二图像特征可以包括形体姿态特征,也即可包括人体形体特征和人体姿态特征等人体图像特征,本公开实施例中的第一图像特征也相应包括形体姿态特征,所述三维重建数据可以包括人体区域位置和人体关节点位置等三维重建相关数据。In a specific implementation, the embodiment of the present disclosure can be applied to a scene of performing three-dimensional reconstruction of a human body image in a video, that is, the target object in the embodiment of the present disclosure can be a human body image, and the training target object is also a human body image correspondingly. The second image features may include body and posture features, that is, they may include human body image features such as human body body features and human body posture features. The first image features in this embodiment of the present disclosure also include body and posture features. The three-dimensional reconstruction data It can include three-dimensional reconstruction related data such as the position of the human body region and the position of the human body joint points.
在前述相关网络的训练过程中,所述第二误差则可以包括人体关节投影误差,所述第一误差页可以相应包括人体关节投影误差。也就是说,在对所述三维重建网络、视频帧编码网络、时序特征提取网络的训练过程中,均可以最小化关节投影误差为训练目标,即在训练过程中可以使训练网络输出的三维人体关节位置与训练图像中标注的人体关节位置之间的差值也即关节投影误差不断趋小,直至该误差稳定于一较小误差,该误差可以很小,以保证训练出的相关网络具备较高的精度。In the training process of the aforementioned correlation network, the second error may include a human body joint projection error, and the first error page may correspondingly include a human body joint projection error. That is to say, in the training process of the three-dimensional reconstruction network, the video frame encoding network, and the time series feature extraction network, it is possible to minimize the joint projection error as the training goal, that is, during the training process, the three-dimensional human body output by the training network can be The difference between the position of the joint and the position of the human body marked in the training image, that is, the joint projection error keeps getting smaller until the error stabilizes at a small error, which can be very small to ensure that the trained correlation network has a relatively small value. high precision.
这样,可以应用本公开实施例中的视频图像处理方法对人体类视频图像中的人体进行三维重建,生成相应的人体三维动态图像。In this way, the video image processing method in the embodiment of the present disclosure can be used to perform three-dimensional reconstruction of the human body in the human body video image, and generate a corresponding three-dimensional dynamic image of the human body.
进一步的,所述三维重建数据还包括人体三维形体数据,所述第二误差还包括人体三维表面顶点误差。Further, the three-dimensional reconstruction data further includes three-dimensional body data, and the second error further includes vertex errors of the three-dimensional surface of the human body.
即当各训练图像中还标注有三维形体数据时,也即在各训练图像中标注了人体三维重建的位置时,所述第二误差还可以包括人体三维表面顶点误差,所述训练目标相应还可以包括最小化人体三维表面顶点误差,即在训练过程中可以使训练网络输出的三维人体重建结果中的表面三维顶点位置与人工标注的人体表面三维顶点位置之间的差值,也即三维表面顶点误差不断趋小,直至该误差稳定于一较小误差。That is, when each training image is also marked with three-dimensional body data, that is, when the position of the three-dimensional reconstruction of the human body is marked in each training image, the second error may also include the vertex error of the three-dimensional surface of the human body, and the training target is correspondingly further. It can include minimizing the vertex error of the 3D surface of the human body, that is, in the training process, it can make the difference between the 3D vertex position of the surface in the 3D human reconstruction result output by the training network and the manually marked 3D vertex position of the human body surface, that is, the 3D surface. The vertex error keeps getting smaller until the error stabilizes at a small error.
下面结合图2,以目标对象为人体图像为例,对本公开实施例中的视频图像处理方法的具体实施方式进行举例说明:2 , the specific implementation of the video image processing method in the embodiment of the present disclosure will be illustrated by taking the target object as a human body image as an example:
首先,可以将视频中的首帧图像即第一帧图像输入至三维人体重建网络,得到首帧图像中的人体三维重建结果,该人体三维重建结果中可以包括人体关节位置、人体区域等信 息。该三维人体重建网络可以是一大型骨干卷积神经网络,具有较多的参数量和较大的计算量,可以利用海量的单帧人体图像标注数据训练得到,该三维人体重建网络能够对单帧彩色图像中的人体进行精确地三维重建。First, the first frame image in the video, that is, the first frame image, can be input into the 3D human body reconstruction network, and the 3D reconstruction result of the human body in the first frame image can be obtained. The 3D human body reconstruction network can be a large-scale backbone convolutional neural network, which has a large amount of parameters and a large amount of calculation, and can be trained by using massive single-frame human image annotation data. Precise 3D reconstruction of the human body in color images.
还可以将首帧图像输入至视频帧编码网络,得到首帧图像对应的高层级图像特征,该视频帧编码网络可以是一轻量化的卷积神经网络,具有参数量小、运算速度快的特点,能够满足实时、低延迟运行的要求,所述高层级图像特征可以是卷积神经网络的部分层输出的中间特征,编码了人体形体、姿态等特征信息。The first frame image can also be input into the video frame encoding network to obtain the high-level image features corresponding to the first frame image. The video frame encoding network can be a lightweight convolutional neural network, which has the characteristics of small parameter quantity and fast operation speed. , which can meet the requirements of real-time and low-latency operation, and the high-level image features can be intermediate features output by some layers of the convolutional neural network, which encode feature information such as human body shape and posture.
接着,可以将视频帧编码网络输出的首帧图像的高层级图像特征和三维人体重建网络输出的三维人体重建结果,共同输入至时序特征提取网络,该时序特征提取网络的作用是,综合当前帧的高层级图像特征与三维人体重建结果,对当前帧中的时序特征进行提取,并传递给后续帧中的人体三维模型重建流程。Next, the high-level image features of the first frame image output by the video frame encoding network and the 3D human body reconstruction results output by the 3D human body reconstruction network can be jointly input to the time series feature extraction network. The function of the time series feature extraction network is to synthesize the current frame. The high-level image features and the 3D human body reconstruction results are extracted from the time series features in the current frame and passed to the reconstruction process of the 3D human body model in the subsequent frames.
然后,将视频中的第二帧图像输入至视频帧编码网络,以得到相应的高层级图像特征,并将该高层级图像特征与上一步骤中传递而来的上一帧图像中的时序特征进行融合,并通过一简单的卷积神经网络对融合后的特征进行回归,便可得到第二帧图像的三维人体重建结果。然后,可以重复上述综合当前帧的高层级图像特征与三维人体重建结果,对当前帧中的时序特征进行提取的流程。Then, the second frame image in the video is input to the video frame coding network to obtain the corresponding high-level image features, and the high-level image features are compared with the time series features in the previous frame image passed in the previous step. Fusion is performed, and the fused features are regressed through a simple convolutional neural network, and the 3D human body reconstruction result of the second frame image can be obtained. Then, the above-mentioned process of synthesizing the high-level image features of the current frame and the three-dimensional human body reconstruction result to extract the time series features in the current frame may be repeated.
对于后续帧图像,均可以采用与第二帧图像的三维人体重建流程类似的方式,来得到后续每一帧图像的三维人体重建结果,即利用前一帧传递的时序特征和视频帧编码网络提取的高层级图像特征对当前帧的三维人体模型进行重建,继而生成当前帧的时序特征。For the subsequent frame images, a method similar to the 3D human body reconstruction process of the second frame image can be used to obtain the 3D human body reconstruction result of each subsequent frame image, that is, using the time series features transmitted in the previous frame and the video frame encoding network extraction. The high-level image features of the current frame reconstruct the 3D human body model of the current frame, and then generate the time series features of the current frame.
本公开实施例中的视频图像处理方法,将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1;将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。这样,通过使用三维重建网络对视频的第一帧图像中的目标对象进行三维重建,得到较为精准的三维重建结果,并对于视频中的后续每帧图像,通过结合第一帧图像中目标对象的三维重建结果和每帧图像的第一图像特征,便可实现快速地对每帧图像中的目标对象进行精确地三维重建。该方案相比相关技术中的方案,具有计算量小,处理速度快即效率高的优点。In the video image processing method in the embodiment of the present disclosure, the first frame image in the target video is respectively input to a 3D reconstruction network and a video frame encoding network, and the target object in the first frame image output by the 3D reconstruction network is obtained. The three-dimensional reconstruction result of , and the first image feature of the first frame image output by the video frame encoding network, wherein the first image feature is the image feature for the target object; The first image feature of the ith frame image and the three-dimensional reconstruction result of the target object in the ith frame image are input to the time series feature extraction network to obtain the time series feature of the ith frame image, wherein, the initial value of i is 1; input the i+1 th frame image in the target video into the video frame coding network to obtain the first image feature of the i+1 th frame image; based on the i+1 th frame image an image feature and the time sequence feature of the i-th frame image to generate a three-dimensional reconstruction result of the target object in the i+1-th frame image; update the value of i to i+1, and repeat the above-mentioned ith-frame image The first image feature and the three-dimensional reconstruction result of the target object in the ith frame image are input to the time series feature extraction network to the time series based on the first image feature of the i+1th frame image and the ith frame image feature, the step of generating the 3D reconstruction result of the target object in the i+1 th frame image until i=N, where N is the total number of frames of the target video. In this way, by using the 3D reconstruction network to perform 3D reconstruction on the target object in the first frame of the video, a relatively accurate 3D reconstruction result is obtained, and for each subsequent frame in the video, by combining the target object in the first frame of image The three-dimensional reconstruction result and the first image feature of each frame of image can quickly and accurately perform three-dimensional reconstruction of the target object in each frame of image. Compared with the solutions in the related art, the solution has the advantages of small calculation amount, high processing speed and high efficiency.
图3是根据一示例性实施例示出的一种视频图像处理装置框图。参照图3,该视频图像处理装置包括第一处理模块301、第二处理模块302、第三处理模块303、三维重建模块304和执行模块305。Fig. 3 is a block diagram of a video image processing apparatus according to an exemplary embodiment. 3 , the video image processing apparatus includes a first processing module 301 , a second processing module 302 , a third processing module 303 , a three-dimensional reconstruction module 304 and an execution module 305 .
该第一处理模块301被配置为执行将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;The first processing module 301 is configured to input the first frame image in the target video to the 3D reconstruction network and the video frame coding network respectively, and obtain the target object in the first frame image output by the 3D reconstruction network. a three-dimensional reconstruction result, and a first image feature of the first frame image output by the video frame encoding network, wherein the first image feature is an image feature for the target object;
该第二处理模块302被配置为执行将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1,其中,i为大于1的整数;The second processing module 302 is configured to input the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image into the time series feature extraction network to obtain the The time series feature of the i-th frame image, wherein, the initial value of i is 1, and wherein, i is an integer greater than 1;
该第三处理模块303被配置为执行将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;The third processing module 303 is configured to input the i+1 th frame image in the target video into the video frame coding network to obtain the first image feature of the i+1 th frame image;
该三维重建模块304被配置为执行基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;The 3D reconstruction module 304 is configured to perform a 3D reconstruction of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time series feature of the i th frame image result;
该执行模块305被配置为执行将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。The executing module 305 is configured to execute updating the value of i to i+1, and repeatedly execute the above-mentioned inputting the first image feature of the ith frame image and the 3D reconstruction result of the target object in the ith frame image into the time series feature Extracting the network to the step of generating the three-dimensional reconstruction result of the target object in the i+1th frame image based on the first image feature of the i+1th frame image and the time series feature of the ith frame image, until i =N, where N is the total number of frames of the target video.
在一些实施例中,所述三维重建网络的结构参数数量大于所述视频帧编码网络的结构参数数量。In some embodiments, the number of structural parameters of the three-dimensional reconstruction network is greater than the number of structural parameters of the video frame encoding network.
在一些实施例中,三维重建模块304包括:In some embodiments, the three-dimensional reconstruction module 304 includes:
融合单元,被配置为执行对所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征进行融合,得到所述第i+1帧图像的融合特征;a fusion unit, configured to perform fusion of the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image to obtain the fusion feature of the i+1 th frame image;
三维重建单元,被配置为执行基于所述第i+1帧图像的融合特征对所述第i+1帧图像中的目标对象进行三维重建,得到所述第i+1帧图像中的目标对象的三维重建结果。A three-dimensional reconstruction unit, configured to perform three-dimensional reconstruction of the target object in the i+1th frame image based on the fusion feature of the i+1th frame image, to obtain the target object in the i+1th frame image 3D reconstruction results.
在一些实施例中,所述三维重建网络的训练过程包括:In some embodiments, the training process of the 3D reconstruction network includes:
获取标注有第一对象的三维重建数据的训练图像集,其中,所述第一对象的类型与所述目标对象的类型相同;acquiring a training image set marked with three-dimensional reconstruction data of a first object, wherein the type of the first object is the same as the type of the target object;
将所述训练图像集中的训练图像输入初始三维重建网络,得到各训练图像的三维重建数据;inputting the training images in the training image set into the initial three-dimensional reconstruction network to obtain three-dimensional reconstruction data of each training image;
计算所述各训练图像的三维重建数据与标注的各训练图像的三维重建数据之间的第一误差;calculating the first error between the three-dimensional reconstruction data of each training image and the marked three-dimensional reconstruction data of each training image;
基于所述第一误差,对所述初始三维重建网络的模型参数进行调整,得到训练好的所述三维重建网络。Based on the first error, the model parameters of the initial 3D reconstruction network are adjusted to obtain the trained 3D reconstruction network.
在一些实施例中,所述视频帧编码网络和所述时序特征提取网络的训练过程包括:In some embodiments, the training process of the video frame encoding network and the time series feature extraction network includes:
获取标注有第二对象的三维重建数据的训练视频集,其中,所述第二对象的类型与所述目标对象的类型相同;Obtaining a training video set marked with three-dimensional reconstruction data of a second object, wherein the type of the second object is the same as the type of the target object;
将所述训练视频集中的训练视频中的第一帧训练图像输入至训练好的所述三维重建网络,得到所述第一帧训练图像中的第二对象的三维重建结果;inputting the first frame of training images in the training videos in the training video set into the trained three-dimensional reconstruction network to obtain a three-dimensional reconstruction result of the second object in the first frame of training images;
将所述训练视频中的每帧训练图像分别输入至初始视频帧编码网络,得到所述每帧训练图像的第二图像特征,所述第二图像特征为针对所述第二对象的图像特征;Inputting each frame of training image in the training video to the initial video frame encoding network respectively, to obtain the second image feature of the training image of each frame, and the second image feature is the image feature for the second object;
将所述训练视频中的第j帧训练图像中的第二对象的三维重建结果和所述第j帧训练图像的第二图像特征输入至初始时序特征提取网络,得到所述第j帧训练图像的时序特征,其中,j为1至M之间的任意整数,M为所述训练视频的总帧数;Input the 3D reconstruction result of the second object in the jth frame training image in the training video and the second image feature of the jth frame training image into the initial time series feature extraction network to obtain the jth frame training image , where j is any integer between 1 and M, and M is the total number of frames of the training video;
基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果,其中,k为2至M之间的任意整数;Based on the time series feature of the k-1th frame of training image in the training video and the second image feature of the kth frame of training image, a 3D reconstruction result of the second object in the kth frame of training image is generated, where k is any integer between 2 and M;
计算所述每帧训练图像的时序特征中对应的三维重建数据与标注的所述每帧训练图像的三维重建数据之间的第二误差;Calculate the second error between the corresponding three-dimensional reconstruction data in the time series feature of each frame of training image and the marked three-dimensional reconstruction data of each frame of training image;
根据所述第二误差,对所述初始视频帧编码网络的模型参数和所述初始视频帧编码网络的模型参数进行调整,得到训练好的所述视频帧编码网络和所述视频帧编码网络。According to the second error, the model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are adjusted to obtain the trained video frame encoding network and the video frame encoding network.
在一些实施例中,所述基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果,包括:In some embodiments, the second object in the kth frame of training image is generated based on the time series feature of the k-1th frame of training image in the training video and the second image feature of the kth frame of training image 3D reconstruction results, including:
对所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征进行融合,得到所述第k帧训练图像的融合特征;The time sequence feature of the k-1th frame training image in the training video and the second image feature of the kth frame training image are fused to obtain the fusion feature of the kth frame training image;
基于所述第k帧训练图像的融合特征,对所述第k帧训练图像中的第二对象进行三维重建,得到所述第k帧训练图像中的第二对象的三维重建结果。Based on the fusion feature of the kth frame of training image, three-dimensional reconstruction is performed on the second object in the kth frame of training image, and a three-dimensional reconstruction result of the second object in the kth frame of training image is obtained.
在一些实施例中,所述第二对象为人体图像时,所述三维重建数据包括人体区域位置和人体关节点位置,所述第二图像特征包括形体姿态特征,所述第二误差包括人体关节投影误差。In some embodiments, when the second object is a human body image, the three-dimensional reconstruction data includes human body region positions and human body joint point positions, the second image features include body posture features, and the second error includes human body joints projection error.
在一些实施例中,所述三维重建数据还包括人体三维形体数据,所述第二误差还包括人体三维表面顶点误差。In some embodiments, the three-dimensional reconstruction data further includes three-dimensional body data, and the second error further includes vertex errors of the three-dimensional surface of the human body.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.
本公开实施例中的视频图像处理装置300,将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1;将所述目标视频中的第i+1 帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。这样,通过使用三维重建网络对视频的第一帧图像中的目标对象进行三维重建,得到较为精准的三维重建结果,并对于视频中的后续每帧图像,通过结合第一帧图像中目标对象的三维重建结果和每帧图像的第一图像特征,便可实现快速地对每帧图像中的目标对象进行精确地三维重建。该方案相比相关技术中的方案,具有计算量小,处理速度快即效率高的优点。The video image processing apparatus 300 in this embodiment of the present disclosure inputs the first frame of image in the target video to a three-dimensional reconstruction network and a video frame encoding network respectively, and obtains the target in the first frame of image output by the three-dimensional reconstruction network The three-dimensional reconstruction result of the object, and the first image feature of the first frame image output by the video frame encoding network, wherein the first image feature is the image feature for the target object; the target video The first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image are input to the time series feature extraction network to obtain the time-series features of the i-th frame image, wherein the initial value of i is 1; input the i+1 th frame image in the target video into the video frame coding network to obtain the first image feature of the i+1 th frame image; based on the i+1 th frame image The first image feature and the time sequence feature of the i-th frame image generate the three-dimensional reconstruction result of the target object in the i+1-th frame image; update the value of i to i+1, and repeat the above-mentioned i-th frame image. The first image feature of the image and the three-dimensional reconstruction result of the target object in the i-th frame image are input to the time series feature extraction network to the first image feature based on the i+1-th frame image and the i-th frame image. Timing feature, the step of generating the 3D reconstruction result of the target object in the i+1th frame image, until i=N, where N is the total number of frames of the target video. In this way, by using the 3D reconstruction network to perform 3D reconstruction on the target object in the first frame of the video, a relatively accurate 3D reconstruction result is obtained. The three-dimensional reconstruction result and the first image feature of each frame of image can quickly and accurately perform three-dimensional reconstruction of the target object in each frame of image. Compared with the solutions in the related art, the solution has the advantages of small calculation amount, high processing speed and high efficiency.
图4是根据一示例性实施例示出的一种用于电子设备400的框图。FIG. 4 is a block diagram of an electronic device 400 according to an exemplary embodiment.
在示例性实施例中,还提供了一种包括指令的计算机可读存储介质,例如包括指令的存储器410,上述指令可由电子设备400的处理器420执行以完成上述视频图像处理方法。在一些实施例中,计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium including instructions, such as a memory 410 including instructions, is also provided, and the instructions can be executed by the processor 420 of the electronic device 400 to complete the above video image processing method. In some embodiments, the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
在图4中,总线架构可以包括任意数量的互联的总线和桥,具体由处理器420代表的一个或多个处理器和存储器410代表的存储器的各种电路链接在一起。总线架构还可以将诸如***设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口430提供接口。In FIG. 4, the bus architecture may include any number of interconnected buses and bridges, in particular one or more processors, represented by processor 420, and various circuits of memory, represented by memory 410, linked together. The bus architecture may also link together various other circuits, such as peripherals, voltage regulators, and power management circuits, which are well known in the art and, therefore, will not be described further herein. The bus interface 430 provides the interface.
处理器420负责管理总线架构和通常的处理,存储器410可以存储处理器420在执行操作时所使用的数据。The processor 420 is responsible for managing the bus architecture and general processing, and the memory 410 may store data used by the processor 420 in performing operations.
在示例性实施例中,还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现上述视频图像处理方法。In an exemplary embodiment, a computer program product is also provided, including a computer program, which implements the above-mentioned video image processing method when executed by a processor.
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the protection scope required by the present disclosure.

Claims (19)

  1. 一种视频图像处理方法,其中,所述方法包括:A video image processing method, wherein the method comprises:
    将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;Inputting the first frame image in the target video to a 3D reconstruction network and a video frame encoding network respectively, to obtain a 3D reconstruction result of the target object in the first frame image output by the 3D reconstruction network, and the video frame encoding the first image feature of the first frame of image output by the network, wherein the first image feature is an image feature for the target object;
    将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1;Inputting the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image to the time series feature extraction network to obtain the time series feature of the i-th frame image, wherein , the initial value of i is 1;
    将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;Inputting the i+1th frame image in the target video to the video frame encoding network to obtain the first image feature of the i+1th frame image;
    基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;generating a three-dimensional reconstruction result of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image;
    将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。The value of i is updated to i+1, and the above-mentioned first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image are input into the time series feature extraction network to be based on the i-th frame image. The first image feature of the +1 frame image and the time series feature of the i-th frame image, the step of generating the 3D reconstruction result of the target object in the i+1-th frame image, until i=N, where N is the The total number of frames in the target video.
  2. 根据权利要求1所述的方法,其中,所述三维重建网络的结构参数数量大于所述视频帧编码网络的结构参数数量。The method of claim 1, wherein the number of structural parameters of the three-dimensional reconstruction network is greater than the number of structural parameters of the video frame encoding network.
  3. 根据权利要求1所述的方法,其中,所述基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果,包括:The method according to claim 1, wherein the target in the i+1 th frame image is generated based on the first image feature of the i+1 th frame image and the time series feature of the i th frame image 3D reconstruction results of objects, including:
    对所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征进行融合,得到所述第i+1帧图像的融合特征;Fusing the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image to obtain the fusion feature of the i+1 th frame image;
    基于所述第i+1帧图像的融合特征对所述第i+1帧图像中的目标对象进行三维重建,得到所述第i+1帧图像中的目标对象的三维重建结果。Three-dimensional reconstruction is performed on the target object in the i+1 th frame image based on the fusion feature of the i+1 th frame image, and a 3D reconstruction result of the target object in the i+1 th frame image is obtained.
  4. 根据权利要求1所述的方法,其中,所述三维重建网络的训练过程包括:The method according to claim 1, wherein the training process of the 3D reconstruction network comprises:
    获取标注有第一对象的三维重建数据的训练图像集,其中,所述第一对象的类型与所述目标对象的类型相同;acquiring a training image set marked with three-dimensional reconstruction data of a first object, wherein the type of the first object is the same as the type of the target object;
    将所述训练图像集中的训练图像输入初始三维重建网络,得到各训练图像的三维重建数据;inputting the training images in the training image set into the initial three-dimensional reconstruction network to obtain three-dimensional reconstruction data of each training image;
    计算所述各训练图像的三维重建数据与标注的各训练图像的三维重建数据之间的第一误差;calculating the first error between the three-dimensional reconstruction data of each training image and the marked three-dimensional reconstruction data of each training image;
    基于所述第一误差,对所述初始三维重建网络的模型参数进行调整,得到训练好的所述三维重建网络。Based on the first error, the model parameters of the initial 3D reconstruction network are adjusted to obtain the trained 3D reconstruction network.
  5. 根据权利要求4所述的方法,其中,所述视频帧编码网络和所述时序特征提取网络的训练过程包括:The method according to claim 4, wherein the training process of the video frame encoding network and the time series feature extraction network comprises:
    获取标注有第二对象的三维重建数据的训练视频集,其中,所述第二对象的类型与所述目标对象的类型相同;Obtaining a training video set marked with three-dimensional reconstruction data of a second object, wherein the type of the second object is the same as the type of the target object;
    将所述训练视频集中的训练视频中的第一帧训练图像输入至训练好的所述三维重建网络,得到所述第一帧训练图像中的第二对象的三维重建结果;inputting the first frame of training images in the training videos in the training video set into the trained three-dimensional reconstruction network to obtain a three-dimensional reconstruction result of the second object in the first frame of training images;
    将所述训练视频中的每帧训练图像分别输入至初始视频帧编码网络,得到所述每帧训练图像的第二图像特征,所述第二图像特征为针对所述第二对象的图像特征;Inputting each frame of training image in the training video to the initial video frame encoding network respectively, to obtain the second image feature of the training image of each frame, and the second image feature is the image feature for the second object;
    将所述训练视频中的第j帧训练图像中的第二对象的三维重建结果和所述第j帧训练图像的第二图像特征输入至初始时序特征提取网络,得到所述第j帧训练图像的时序特征,其中,j为1至M之间的任意整数,M为所述训练视频的总帧数;Input the 3D reconstruction result of the second object in the jth frame training image in the training video and the second image feature of the jth frame training image into the initial time series feature extraction network to obtain the jth frame training image , where j is any integer between 1 and M, and M is the total number of frames of the training video;
    基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果,其中,k为2至M之间的任意整数;Based on the time series feature of the k-1th frame of training image in the training video and the second image feature of the kth frame of training image, a 3D reconstruction result of the second object in the kth frame of training image is generated, where k is any integer between 2 and M;
    计算所述每帧训练图像的时序特征中对应的三维重建数据与标注的所述每帧训练图像的三维重建数据之间的第二误差;Calculate the second error between the corresponding three-dimensional reconstruction data in the time series feature of each frame of training image and the marked three-dimensional reconstruction data of each frame of training image;
    根据所述第二误差,对所述初始视频帧编码网络的模型参数和所述初始视频帧编码网络的模型参数进行调整,得到训练好的所述视频帧编码网络和所述视频帧编码网络。According to the second error, the model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are adjusted to obtain the trained video frame encoding network and the video frame encoding network.
  6. 根据权利要求5所述的方法,其中,所述基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果,包括:The method according to claim 5, wherein the kth frame of training image is generated based on the time series feature of the k-1th frame of training image and the second image feature of the kth frame of training image in the training video The 3D reconstruction results of the second object in , including:
    对所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征进行融合,得到所述第k帧训练图像的融合特征;The time sequence feature of the k-1th frame training image in the training video and the second image feature of the kth frame training image are fused to obtain the fusion feature of the kth frame training image;
    基于所述第k帧训练图像的融合特征,对所述第k帧训练图像中的第二对象进行三维重建,得到所述第k帧训练图像中的第二对象的三维重建结果。Based on the fusion feature of the kth frame of training image, three-dimensional reconstruction is performed on the second object in the kth frame of training image, and a three-dimensional reconstruction result of the second object in the kth frame of training image is obtained.
  7. 根据权利要求5所述的方法,其中,基于所述第二对象为人体图像,所述三维重建数据包括人体区域位置和人体关节点位置,所述第二图像特征包括形体姿态特征,所述第二误差包括人体关节投影误差。The method according to claim 5, wherein, based on the second object being a human body image, the three-dimensional reconstruction data includes human body region positions and human body joint point positions, the second image features include body posture features, and the third The second error includes human joint projection error.
  8. 根据权利要求7所述的方法,其中,所述三维重建数据还包括人体三维形体数据,所述第二误差还包括人体三维表面顶点误差。The method according to claim 7, wherein the three-dimensional reconstruction data further comprises human body three-dimensional shape data, and the second error further comprises a vertex error of the three-dimensional surface of the human body.
  9. 一种视频图像处理装置,其中,所述装置包括:A video image processing device, wherein the device comprises:
    第一处理模块,被配置为将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;The first processing module is configured to input the first frame image in the target video to the 3D reconstruction network and the video frame coding network respectively, and obtain the 3D reconstruction of the target object in the first frame image output by the 3D reconstruction network the result, and the first image feature of the first frame image output by the video frame encoding network, wherein the first image feature is an image feature for the target object;
    第二处理模块,被配置为将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1,其中,i为大于1的整数;The second processing module is configured to input the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image to the time series feature extraction network, and obtain the first image feature of the i-th frame image. Time series features of i frame images, where the initial value of i is 1, where i is an integer greater than 1;
    第三处理模块,被配置为将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;a third processing module, configured to input the i+1 th frame image in the target video into the video frame encoding network, to obtain the first image feature of the i+1 th frame image;
    三维重建模块,被配置为基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;a three-dimensional reconstruction module, configured to generate a three-dimensional reconstruction result of the target object in the i+1th frame image based on the first image feature of the i+1th frame image and the time sequence feature of the ith frame image;
    执行模块,被配置为将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。The execution module is configured to update the value of i to i+1, and repeat the above-mentioned input of the first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image to the time series feature extraction network To the step of generating the three-dimensional reconstruction result of the target object in the i+1th frame image based on the first image feature of the i+1th frame image and the time series feature of the ith frame image, until i=N , where N is the total number of frames of the target video.
  10. 根据权利要求9所述的视频图像处理装置,其中,所述三维重建网络的结构参数数量大于所述视频帧编码网络的结构参数数量。The video image processing apparatus according to claim 9, wherein the number of structural parameters of the three-dimensional reconstruction network is greater than the number of structural parameters of the video frame encoding network.
  11. 根据权利要求9所述的视频图像处理装置,其中,所述三维重建模块包括:The video image processing apparatus according to claim 9, wherein the three-dimensional reconstruction module comprises:
    融合单元,被配置为对所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征进行融合,得到所述第i+1帧图像的融合特征;a fusion unit, configured to fuse the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image to obtain the fusion feature of the i+1 th frame image;
    三维重建单元,被配置为基于所述第i+1帧图像的融合特征对所述第i+1帧图像中的目标对象进行三维重建,得到所述第i+1帧图像中的目标对象的三维重建结果。The three-dimensional reconstruction unit is configured to perform three-dimensional reconstruction on the target object in the i+1th frame image based on the fusion feature of the i+1th frame image, and obtain the target object in the i+1th frame image. 3D reconstruction results.
  12. 根据权利要求9所述的视频图像处理装置,其中,所述三维重建网络的训练过程包括:The video image processing apparatus according to claim 9, wherein the training process of the three-dimensional reconstruction network comprises:
    获取标注有第一对象的三维重建数据的训练图像集,其中,所述第一对象的类型与所述目标对象的类型相同;acquiring a training image set marked with three-dimensional reconstruction data of a first object, wherein the type of the first object is the same as the type of the target object;
    将所述训练图像集中的训练图像输入初始三维重建网络,得到各训练图像的三维重建数据;inputting the training images in the training image set into the initial three-dimensional reconstruction network to obtain three-dimensional reconstruction data of each training image;
    计算所述各训练图像的三维重建数据与标注的各训练图像的三维重建数据之间的第一误差;calculating the first error between the three-dimensional reconstruction data of each training image and the marked three-dimensional reconstruction data of each training image;
    基于所述第一误差,对所述初始三维重建网络的模型参数进行调整,得到训练好的所述三维重建网络。Based on the first error, the model parameters of the initial 3D reconstruction network are adjusted to obtain the trained 3D reconstruction network.
  13. 根据权利要求12所述的视频图像处理装置,其中,所述视频帧编码网络和所述时序特征提取网络的训练过程包括:The video image processing apparatus according to claim 12, wherein the training process of the video frame encoding network and the time series feature extraction network comprises:
    获取标注有第二对象的三维重建数据的训练视频集,其中,所述第二对象的类型与所述目标对象的类型相同;Obtaining a training video set marked with three-dimensional reconstruction data of a second object, wherein the type of the second object is the same as the type of the target object;
    将所述训练视频集中的训练视频中的第一帧训练图像输入至训练好的所述三维重建网络,得到所述第一帧训练图像中的第二对象的三维重建结果;inputting the first frame of training images in the training videos in the training video set into the trained three-dimensional reconstruction network to obtain a three-dimensional reconstruction result of the second object in the first frame of training images;
    将所述训练视频中的每帧训练图像分别输入至初始视频帧编码网络,得到所述每帧训 练图像的第二图像特征,所述第二图像特征为针对所述第二对象的图像特征;Each frame of training image in the described training video is input to the initial video frame encoding network respectively, obtains the second image feature of the described every frame of training image, and the second image feature is the image feature for the second object;
    将所述训练视频中的第j帧训练图像中的第二对象的三维重建结果和所述第j帧训练图像的第二图像特征输入至初始时序特征提取网络,得到所述第j帧训练图像的时序特征,其中,j为1至M之间的任意整数,M为所述训练视频的总帧数;Input the 3D reconstruction result of the second object in the jth frame training image in the training video and the second image feature of the jth frame training image into the initial time series feature extraction network to obtain the jth frame training image , where j is any integer between 1 and M, and M is the total number of frames of the training video;
    基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果,其中,k为2至M之间的任意整数;Based on the time series feature of the k-1th frame of training image in the training video and the second image feature of the kth frame of training image, a 3D reconstruction result of the second object in the kth frame of training image is generated, where k is any integer between 2 and M;
    计算所述每帧训练图像的时序特征中对应的三维重建数据与标注的所述每帧训练图像的三维重建数据之间的第二误差;Calculate the second error between the corresponding three-dimensional reconstruction data in the time series feature of each frame of training image and the marked three-dimensional reconstruction data of each frame of training image;
    根据所述第二误差,对所述初始视频帧编码网络的模型参数和所述初始视频帧编码网络的模型参数进行调整,得到训练好的所述视频帧编码网络和所述视频帧编码网络。According to the second error, the model parameters of the initial video frame encoding network and the model parameters of the initial video frame encoding network are adjusted to obtain the trained video frame encoding network and the video frame encoding network.
  14. 根据权利要求9所述的视频图像处理装置,其中,所述基于所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征,生成所述第k帧训练图像中的第二对象的三维重建结果,包括:The video image processing apparatus according to claim 9, wherein the k-th frame is generated based on the time series feature of the k-1 th frame of training image and the second image feature of the k-th frame of the training image in the training video. Frame the 3D reconstruction results of the second object in the training image, including:
    对所述训练视频中的第k-1帧训练图像的时序特征和第k帧训练图像的第二图像特征进行融合,得到所述第k帧训练图像的融合特征;The time sequence feature of the k-1th frame training image in the training video and the second image feature of the kth frame training image are fused to obtain the fusion feature of the kth frame training image;
    基于所述第k帧训练图像的融合特征,对所述第k帧训练图像中的第二对象进行三维重建,得到所述第k帧训练图像中的第二对象的三维重建结果。Based on the fusion feature of the kth frame of training image, three-dimensional reconstruction is performed on the second object in the kth frame of training image, and a three-dimensional reconstruction result of the second object in the kth frame of training image is obtained.
  15. 根据权利要求13所述的视频图像处理装置,其中,基于所述第二对象为人体图像,所述三维重建数据包括人体区域位置和人体关节点位置,所述第二图像特征包括形体姿态特征,所述第二误差包括人体关节投影误差。The video image processing apparatus according to claim 13, wherein, based on the second object being a human body image, the three-dimensional reconstruction data includes a human body region position and a human body joint point position, and the second image feature includes a body posture feature, The second error includes human joint projection error.
  16. 根据权利要求8所述的视频图像处理装置,其中,所述三维重建数据还包括人体三维形体数据,所述第二误差还包括人体三维表面顶点误差。The video image processing apparatus according to claim 8, wherein the three-dimensional reconstruction data further includes three-dimensional body data, and the second error further includes a vertex error of the three-dimensional surface of the human body.
  17. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device comprises:
    处理器;processor;
    用于存储所述处理器可执行指令的存储器;a memory for storing the processor-executable instructions;
    其中,所述处理器被配置为执行所述指令,以实现一种视频图像处理方法,所述方法包括:Wherein, the processor is configured to execute the instructions to implement a video image processing method, the method comprising:
    将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;Inputting the first frame image in the target video to a 3D reconstruction network and a video frame encoding network respectively, to obtain a 3D reconstruction result of the target object in the first frame image output by the 3D reconstruction network, and the video frame encoding the first image feature of the first frame of image output by the network, wherein the first image feature is an image feature for the target object;
    将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1;Inputting the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image to the time series feature extraction network to obtain the time series feature of the i-th frame image, wherein , the initial value of i is 1;
    将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;Inputting the i+1th frame image in the target video to the video frame encoding network to obtain the first image feature of the i+1th frame image;
    基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;generating a three-dimensional reconstruction result of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image;
    将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。The value of i is updated to i+1, and the above-mentioned first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image are input into the time series feature extraction network to be based on the i-th frame image. The first image feature of the +1 frame image and the time series feature of the i-th frame image, the step of generating the 3D reconstruction result of the target object in the i+1-th frame image, until i=N, where N is the The total number of frames in the target video.
  18. 一种非易失性计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行一种视频图像处理方法,所述方法包括:A non-volatile computer-readable storage medium, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, enabling the electronic device to execute a video image processing method, the method comprising :
    将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;Inputting the first frame image in the target video to a 3D reconstruction network and a video frame encoding network respectively, to obtain a 3D reconstruction result of the target object in the first frame image output by the 3D reconstruction network, and the video frame encoding the first image feature of the first frame of image output by the network, wherein the first image feature is an image feature for the target object;
    将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1;Inputting the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image to the time series feature extraction network to obtain the time series feature of the i-th frame image, wherein , the initial value of i is 1;
    将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图像的第一图像特征;Inputting the i+1th frame image in the target video to the video frame encoding network to obtain the first image feature of the i+1th frame image;
    基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;generating a three-dimensional reconstruction result of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image;
    将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。The value of i is updated to i+1, and the above-mentioned first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image are input into the time series feature extraction network to be based on the i-th frame image. The first image feature of the +1 frame image and the time series feature of the i-th frame image, the step of generating the 3D reconstruction result of the target object in the i+1-th frame image, until i=N, where N is the The total number of frames in the target video.
  19. 一种计算机程序产品,包括计算机程序,其中,所述计算机程序被处理器执行时实现一种视频图像处理方法,所述方法包括:A computer program product, comprising a computer program, wherein, when the computer program is executed by a processor, a video image processing method is implemented, the method comprising:
    将目标视频中的第一帧图像分别输入至三维重建网络和视频帧编码网络,得到所述三维重建网络输出的所述第一帧图像中的目标对象的三维重建结果,以及所述视频帧编码网络输出的所述第一帧图像的第一图像特征,其中,所述第一图像特征为针对所述目标对象的图像特征;Inputting the first frame image in the target video to a 3D reconstruction network and a video frame coding network respectively, to obtain a 3D reconstruction result of the target object in the first frame image output by the 3D reconstruction network, and the video frame coding the first image feature of the first frame of image output by the network, wherein the first image feature is an image feature for the target object;
    将所述目标视频中的第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络,得到所述第i帧图像的时序特征,其中,i的初始值为1;Inputting the first image feature of the i-th frame image in the target video and the three-dimensional reconstruction result of the target object in the i-th frame image to the time series feature extraction network to obtain the time series feature of the i-th frame image, wherein , the initial value of i is 1;
    将所述目标视频中的第i+1帧图像输入至所述视频帧编码网络,得到所述第i+1帧图 像的第一图像特征;The i+1 frame image in the described target video is input to the described video frame coding network, obtains the first image feature of the i+1 frame image;
    基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果;generating a three-dimensional reconstruction result of the target object in the i+1 th frame image based on the first image feature of the i+1 th frame image and the time sequence feature of the i th frame image;
    将i的值更新为i+1,重复执行上述将第i帧图像的第一图像特征和所述第i帧图像中的目标对象的三维重建结果输入至时序特征提取网络至基于所述第i+1帧图像的第一图像特征和所述第i帧图像的时序特征,生成所述第i+1帧图像中的目标对象的三维重建结果的步骤,直至i=N,其中,N为所述目标视频的总帧数。The value of i is updated to i+1, and the above-mentioned first image feature of the i-th frame image and the three-dimensional reconstruction result of the target object in the i-th frame image are input into the time series feature extraction network to be based on the i-th frame image. The first image feature of the +1 frame image and the time series feature of the i-th frame image, the step of generating the 3D reconstruction result of the target object in the i+1-th frame image, until i=N, where N is the The total number of frames in the target video.
PCT/CN2021/127942 2020-12-31 2021-11-01 Video image processing method and apparatus WO2022142702A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011625995.2 2020-12-31
CN202011625995.2A CN112767534B (en) 2020-12-31 2020-12-31 Video image processing method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022142702A1 true WO2022142702A1 (en) 2022-07-07

Family

ID=75699076

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127942 WO2022142702A1 (en) 2020-12-31 2021-11-01 Video image processing method and apparatus

Country Status (2)

Country Link
CN (1) CN112767534B (en)
WO (1) WO2022142702A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596927A (en) * 2023-07-17 2023-08-15 浙江核睿医疗科技有限公司 Endoscope video processing method, system and device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767534B (en) * 2020-12-31 2024-02-09 北京达佳互联信息技术有限公司 Video image processing method, device, electronic equipment and storage medium
CN113963175A (en) * 2021-05-13 2022-01-21 北京市商汤科技开发有限公司 Image processing method and device, computer equipment and storage medium
CN114399718B (en) * 2022-03-21 2022-08-16 北京网之晴科技有限公司 Image content identification method and device in video playing process
WO2023206420A1 (en) * 2022-04-29 2023-11-02 Oppo广东移动通信有限公司 Video encoding and decoding method and apparatus, device, system and storage medium
CN115457432B (en) * 2022-08-25 2023-10-27 埃洛克航空科技(北京)有限公司 Data processing method and device for video frame extraction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190333269A1 (en) * 2017-01-19 2019-10-31 Panasonic Intellectual Property Corporation Of America Three-dimensional reconstruction method, three-dimensional reconstruction apparatus, and generation method for generating three-dimensional model
CN110738211A (en) * 2019-10-17 2020-01-31 腾讯科技(深圳)有限公司 object detection method, related device and equipment
CN111598998A (en) * 2020-05-13 2020-08-28 腾讯科技(深圳)有限公司 Three-dimensional virtual model reconstruction method and device, computer equipment and storage medium
CN112767534A (en) * 2020-12-31 2021-05-07 北京达佳互联信息技术有限公司 Video image processing method and device, electronic equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766839B (en) * 2017-11-09 2020-01-14 清华大学 Motion recognition method and device based on 3D convolutional neural network
CN108122281B (en) * 2017-12-22 2021-08-24 洛阳中科众创空间科技有限公司 Large-range real-time human body three-dimensional reconstruction method
CN109410242B (en) * 2018-09-05 2020-09-22 华南理工大学 Target tracking method, system, equipment and medium based on double-current convolutional neural network
CN109271933B (en) * 2018-09-17 2021-11-16 北京航空航天大学青岛研究院 Method for estimating three-dimensional human body posture based on video stream
WO2020113423A1 (en) * 2018-12-04 2020-06-11 深圳市大疆创新科技有限公司 Target scene three-dimensional reconstruction method and system, and unmanned aerial vehicle
CN109712234B (en) * 2018-12-29 2023-04-07 北京卡路里信息技术有限公司 Three-dimensional human body model generation method, device, equipment and storage medium
CN110874864B (en) * 2019-10-25 2022-01-14 奥比中光科技集团股份有限公司 Method, device, electronic equipment and system for obtaining three-dimensional model of object
CN111311732B (en) * 2020-04-26 2023-06-20 中国人民解放军国防科技大学 3D human body grid acquisition method and device
CN111862275B (en) * 2020-07-24 2023-06-06 厦门真景科技有限公司 Video editing method, device and equipment based on 3D reconstruction technology
CN111738220B (en) * 2020-07-27 2023-09-15 腾讯科技(深圳)有限公司 Three-dimensional human body posture estimation method, device, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190333269A1 (en) * 2017-01-19 2019-10-31 Panasonic Intellectual Property Corporation Of America Three-dimensional reconstruction method, three-dimensional reconstruction apparatus, and generation method for generating three-dimensional model
CN110738211A (en) * 2019-10-17 2020-01-31 腾讯科技(深圳)有限公司 object detection method, related device and equipment
CN111598998A (en) * 2020-05-13 2020-08-28 腾讯科技(深圳)有限公司 Three-dimensional virtual model reconstruction method and device, computer equipment and storage medium
CN112767534A (en) * 2020-12-31 2021-05-07 北京达佳互联信息技术有限公司 Video image processing method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116596927A (en) * 2023-07-17 2023-08-15 浙江核睿医疗科技有限公司 Endoscope video processing method, system and device
CN116596927B (en) * 2023-07-17 2023-09-26 浙江核睿医疗科技有限公司 Endoscope video processing method, system and device

Also Published As

Publication number Publication date
CN112767534B (en) 2024-02-09
CN112767534A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
WO2022142702A1 (en) Video image processing method and apparatus
Chen et al. Talking-head generation with rhythmic head motion
Zhang et al. Interacting two-hand 3d pose and shape reconstruction from single color image
Qian et al. Speech drives templates: Co-speech gesture synthesis with learned templates
Zhao et al. Masked GAN for unsupervised depth and pose prediction with scale consistency
Liu et al. Human pose estimation in video via structured space learning and halfway temporal evaluation
Kim et al. Recurrent temporal aggregation framework for deep video inpainting
WO2021063271A1 (en) Human body model reconstruction method and reconstruction system, and storage medium
Ye et al. Audio-driven talking face video generation with dynamic convolution kernels
US9129434B2 (en) Method and system for 3D surface deformation fitting
CN110610486A (en) Monocular image depth estimation method and device
JP2017531242A (en) Method and device for editing face image
Huang et al. Object-occluded human shape and pose estimation with probabilistic latent consistency
Ren et al. Hr-net: a landmark based high realistic face reenactment network
Wei et al. Learning to infer semantic parameters for 3D shape editing
Liu et al. Event-based monocular dense depth estimation with recurrent transformers
Li et al. Skeleton2humanoid: Animating simulated characters for physically-plausible motion in-betweening
WO2020193972A1 (en) Facial analysis
Liu et al. Geomim: Towards better 3d knowledge transfer via masked image modeling for multi-view 3d understanding
Gao et al. Edge Devices Friendly Self-Supervised Monocular Depth Estimation Via Knowledge Distillation
Liu et al. Event-based monocular depth estimation with recurrent transformers
Song et al. Sparse rig parameter optimization for character animation
Sun et al. SSAT $++ $: A Semantic-Aware and Versatile Makeup Transfer Network With Local Color Consistency Constraint
Bin et al. Fsa-net: a cost-efficient face swapping attention network with occlusion-aware normalization
Zhou et al. MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913446

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.10.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21913446

Country of ref document: EP

Kind code of ref document: A1