CN112232322A - Image generation method and device based on object state prediction - Google Patents

Image generation method and device based on object state prediction Download PDF

Info

Publication number
CN112232322A
CN112232322A CN202011465431.7A CN202011465431A CN112232322A CN 112232322 A CN112232322 A CN 112232322A CN 202011465431 A CN202011465431 A CN 202011465431A CN 112232322 A CN112232322 A CN 112232322A
Authority
CN
China
Prior art keywords
image
feature
image set
feature extraction
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011465431.7A
Other languages
Chinese (zh)
Other versions
CN112232322B (en
Inventor
朱宝成
詹姆士·张
王世军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202011465431.7A priority Critical patent/CN112232322B/en
Priority claimed from CN202011465431.7A external-priority patent/CN112232322B/en
Publication of CN112232322A publication Critical patent/CN112232322A/en
Application granted granted Critical
Publication of CN112232322B publication Critical patent/CN112232322B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the specification provides an image generation method and device based on object state prediction. In the method, an image set formed by a plurality of image frames according to a time sequence is obtained, wherein the image set comprises an object with a moving position in the plurality of image frames and an environment outside the object; inputting the image set into a feature extraction model to obtain first features of the image set, wherein the first features comprise static parameters of an object and static parameters of an environment; determining a second feature of the object from the set of images, including a motion state of the object at a specified time in the set of images; inputting the first characteristic and the second characteristic into a state prediction model to obtain a predicted motion state of the object at least one target moment after the specified moment; a decoder is used to generate a predicted image at the target time based on the predicted motion state.

Description

Image generation method and device based on object state prediction
Technical Field
One or more embodiments of the present disclosure relate to the field of image processing technologies, and in particular, to an image generation method and apparatus based on object state prediction.
Background
With the continuous development of science and technology, image acquisition technology is also continuously advanced, and for example, the application of image acquisition by using equipment such as a mobile phone and a camera is also more and more extensive. The development direction is also very wide based on various processes of images. In general, applications such as vehicle tracking, person locking, and the like can be performed based on the captured image. In the technical field of image vision and the like, predicting possible images in a future period of time based on partial images which are acquired is a new research direction in the technical field of image vision.
Accordingly, improved solutions are desired that can more accurately generate new images based on already acquired partial images.
Disclosure of Invention
One or more embodiments of the present specification describe an image generation method and apparatus based on object state prediction, which can generate a new image more accurately based on a partial image that has been acquired. The specific technical scheme is as follows.
In a first aspect, an embodiment provides an image generation method based on object state prediction, which is performed by a computing platform, and includes:
acquiring an image set formed by a plurality of image frames according to a time sequence, wherein the image set comprises an object with a moving position in the image frames and an environment outside the object;
inputting the image set into a feature extraction model to obtain first features of the image set, wherein the first features comprise static parameters of the object and static parameters of the environment;
determining a second feature of the object from the set of images, including a motion state of the object at a specified time in the set of images;
inputting the first characteristic and the second characteristic into a state prediction model to obtain a predicted motion state of the object at least one target moment after the specified moment;
and generating a prediction image at the target time based on the predicted motion state by using a decoder.
In one embodiment, the step of inputting the image set into a feature extraction model to obtain a first feature of the image set includes:
and determining sub-features corresponding to each two continuous frames of images in the image set through the feature extraction model, and performing feature aggregation on the sub-features to obtain a first feature of the image set.
In one embodiment, the image set further comprises a plurality of location information of the object in a plurality of image frames; the step of inputting the image set into a feature extraction model to obtain a first feature of the image set includes:
and inputting the image frames in the image set and the corresponding position information into a feature extraction model to obtain a first feature of the image set.
In one embodiment, the step of determining a second feature of the object from the set of images comprises:
determining a second feature of the object from the set of images based on a variational auto-encoder;
alternatively, a second feature of the object is determined from the set of images based on a recurrent neural network RNN.
In one embodiment, the step of determining a second feature of the object from the set of images comprises:
a first image frame at a specified time and a previous image frame to the first image frame are selected from the set of images, and a second feature of the object is determined from the first image frame and the previous image frame.
In one embodiment, in training the feature extraction model and the state prediction model, the training is based on a multi-step predictive loss function.
In one embodiment, the static parameters of the object include at least one of mass and volume of the object, and the static parameters of the environment include at least one of ground friction and air resistance in the environment.
In one embodiment, the feature extraction model is implemented by using a convolutional neural network CNN, a deep neural network DNN or a multilayer perceptron MLP; the state prediction model is obtained based on a random state space model.
In a second aspect, an embodiment provides an image generation apparatus based on object state prediction, deployed in a computing platform, the apparatus including:
an image acquisition module configured to acquire an image set formed by a plurality of image frames in a time series, the image set including an object whose position is moved in the plurality of image frames and an environment other than the object;
a feature extraction module configured to input the image set into a feature extraction model to obtain a first feature of the image set, wherein the first feature comprises a static parameter of the object and a static parameter of the environment;
a feature determination module configured to determine a second feature of the object from the image set, wherein the second feature comprises a motion state of the object at a specified time in the image set;
a motion prediction module configured to input the first feature and the second feature into a state prediction model to obtain a predicted motion state of the object at least one target time after the specified time;
an image generation module configured to generate, with a decoder, a predicted image at the target time based on the predicted motion state.
In one embodiment, the feature extraction module is specifically configured to:
and determining sub-features corresponding to each two continuous frames of images in the image set through the feature extraction model, and performing feature aggregation on the sub-features to obtain a first feature of the image set.
In one embodiment, the image set further comprises a plurality of location information of the object in a plurality of image frames; the feature extraction module is specifically configured to:
and inputting the image frames in the image set and the corresponding position information into a feature extraction model to obtain a first feature of the image set.
In one embodiment, the feature determination module is specifically configured to: determining a second feature of the object from the set of images based on a variational auto-encoder; or,
the feature determination module is specifically configured to: determining a second feature of the object from the set of images based on a Recurrent Neural Network (RNN).
In one embodiment, the feature determination module is specifically configured to:
a first image frame at a specified time and a previous image frame to the first image frame are selected from the set of images, and a second feature of the object is determined from the first image frame and the previous image frame.
In one embodiment, in training the feature extraction model and the state prediction model, the training is based on a multi-step predictive loss function.
In one embodiment, the static parameters of the object include at least one of mass and volume of the object, and the static parameters of the environment include at least one of ground friction and air resistance in the environment.
In one embodiment, the feature extraction model is implemented by using a convolutional neural network CNN, a deep neural network DNN or a multilayer perceptron MLP; the state prediction model is obtained based on a random state space model.
In a third aspect, embodiments provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any of the first aspect.
In a fourth aspect, an embodiment provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method of any one of the first aspect.
In the method and the device provided by the embodiment of the specification, the feature extraction model can be used for extracting the feature of the static parameter in the image set, extracting the feature of the motion state of the object at the specified time in the image set, inputting the two features into the state prediction model, obtaining the predicted motion state of the object at the target time after the specified time, and realizing the prediction of the motion state of the object; with the decoder, a prediction image of the target time can be generated based on the predicted motion state. Through the combination of the above various implementation flows, the embodiments of the present specification can generate a new image more accurately based on the images in the image set that have been acquired.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
fig. 2 is a schematic flowchart of an image generation method based on object state prediction according to an embodiment;
FIG. 3 is a schematic diagram of the generation of a predictive image;
fig. 4 is a schematic block diagram of an image generation apparatus based on object state prediction according to an embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The image set includes image frames at a plurality of time points, such as t1, t2, t3, t4, …, tn, and the like, n is an integer greater than 1, the image frames include moving objects, and the specified time point included in the image set is tn as an example. Respectively determining a first characteristic and a second characteristic based on the image set, and respectively inputting the first characteristic and the second characteristic into a state prediction model to obtain the predicted motion state of each target time after the tn time, for example, obtaining the predicted motion state of the tn +1, tn +2, tn +3 and other times; the decoding is performed based on the decoder and each predicted motion state, and a predicted image at the corresponding target time can be obtained. Determining a first feature of the image set through a feature extraction model, wherein the first feature comprises a static parameter of an object and a static parameter of an environment; the second feature contains the motion state at a given time tn in the image set. The number of the predicted images may be one or a plurality of the predicted images, and the embodiment of the present specification is not particularly limited to this.
Based on the above implementation flow, the embodiments of the present specification may generate a prediction image at a subsequent time by predicting the motion state of the object in the image set and adopting a decoding method based on the predicted motion state. The following is a detailed description of specific examples.
Fig. 2 is a flowchart illustrating an image generation method based on object state prediction according to an embodiment. The method is performed by a computing platform, and may specifically be performed by any device, apparatus, platform, apparatus cluster, etc. having computing and processing capabilities. The method includes the following steps S210-S250.
Step S210 is to acquire an image set formed by a plurality of image frames in time series. The image set includes an object whose position is moved in a plurality of image frames and an environment outside the object.
The plurality of image frames can be sampled from a video or can be images continuously acquired by the image acquisition device at a plurality of moments. The image acquisition equipment comprises a common camera, a monitoring camera, a camera and the like.
In a plurality of image frames in an image set, each image frame may contain a corresponding time stamp, which may be an image acquisition time instant of the image frame. The plurality of image frames in the image set may be arranged in the chronological order of the time stamps. The plurality of image frames may include an object whose position is moved, and the number of the object may be one or more. The object is not limited to being present in each image frame of the image set as long as objects at different positions are included in the plurality of image frames of the image set.
The object in the image set may include various objects as long as it meets the condition that the position is moved, for example, a person, a vehicle, a bicycle, an animal, a balloon which is fluttered, a flying bird, and the like. In addition to the objects in the image frame, an environment, i.e. an environmental region, may also be included. The environmental region in the image may include any background region, which is not limited in this application.
Step S220, inputting the image set into the feature extraction model to obtain a first feature of the image set, where the first feature includes a static parameter of the object and a static parameter of the environment. The use of "first" in the first feature herein, and the corresponding "first" in the following description, is for convenience of distinction and description only, and is not intended to be limiting in any way.
The feature extraction model may be trained in advance using the samples. When the training process reaches the convergence condition, it is applied to the embodiment for extracting the first feature of the image set.
The feature extraction model may be trained by using a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), or a multi-layer Perceptron (MLP), and the function of the feature extraction model in this embodiment is implemented.
The static parameters of the object include at least one of a mass and a volume of the object, and may further include other parameters that are not time-varying and that can characterize the object. The static parameters of the environment include at least one of ground friction and air resistance in the environment, the ground friction including friction between the object and the ground. The static parameters of the environment may also include other parameters that can characterize the interaction force between the object and the environment.
When an image set is input to a feature extraction model, a plurality of image frames and corresponding time stamps in the image set may be input to the feature extraction model. The feature extraction model may extract static parameters of the object and the environment through internal multiple computational layers based on changes in position of the object in multiple image frames in the image set and temporal changes between the image frames. When extracting the static parameters of the object and the environment in the image set, the feature extraction model may use a corresponding specific parameter matrix to represent the static parameters.
In this step, the image set is input into the feature extraction model, and when the first feature of the image set is determined by the feature extraction model, the sub-features corresponding to each two consecutive frames of images in the image set can be determined by the feature extraction model, and the features of the sub-features are aggregated to obtain the first feature of the image set.
In one example, the feature extraction model may first determine sub-features of a1 and a2, a3 and a4, and a5 and a6 respectively for 6 image frames a 1-a 6 contained in the image set, and then perform feature aggregation on the three sub-features to obtain a first feature of the image set.
The two consecutive frames refer to two consecutive frames among a plurality of image frames arranged in time series. When the sub-features of the two image frames are determined, the two image frames can be spliced to form a spliced image, and the spliced image is input into a calculation layer of the feature extraction model.
When the features of the individual sub-features are aggregated, the aggregation may be performed in a conventional manner, such as averaging a plurality of sub-features or performing a weighted average.
In step S230, a second feature of the object is determined from the image set, wherein the second feature comprises a motion state of the object at a specified time tn in the image set. The specified time may be a time selected from the image set, and may be any time other than the first time of the image set. In one embodiment, the specified time may be the latest time in the image set.
The specified time in the image set may be determined from timestamps associated with a plurality of image frames included in the image set. Here, the specified time is represented by tn. The specified time tn may be the latest time stamp in the plurality of image frames. The process of the positional movement of the object is recorded in a plurality of image frames, the motion state of the object can be determined from the change in the position of the object relative to the environment in the plurality of image frames, and the motion state at the specified time tn is determined from the plurality of motion states.
The motion state of the object includes at least one of a position, a velocity, an acceleration, an angular velocity, and the like of the object.
When the second feature of the object is determined from the image set, all image frames in the image set may be used for the determination, or a first image frame and a previous image frame of the first image frame at a specified time tn may be selected from the image set, and the second feature of the object may be determined from the first image frame and the previous image frame.
In determining the motion state of the object at the specified time from the image set, a preset information extraction algorithm may be employed. For example, a second feature of the object may be determined from the set of images based on a variational auto-encoder; alternatively, the second feature of the object may be determined from the set of images based on a Recurrent Neural Network (RNN). The variational auto-encoder and the recurrent neural network can be obtained by training based on the sample image in advance.
The variational self-encoder adopts a neural network algorithm architecture, can compress an image to an intermediate representation through a multi-layer neural network, then restores the image to an original image from the intermediate representation, and can be used for extracting a low-dimensional representation of the image. The embodiment may employ a variational self-encoder to extract an intermediate representation of an image frame in the image set, based on which the motion state of the object is determined; in particular, a variational self-encoder may be employed to extract an intermediate representation of the first image frame and its previous image frame.
Step S240, inputting the first characteristic and the second characteristic into a state prediction model to obtain the predicted motion state of the object at least one target moment after the specified moment. The state prediction model may be configured to determine a motion state of the object at the target time based on the input first feature and the input second feature, and to use the motion state as the predicted motion state.
Wherein, the state prediction model can be trained based on a random state space model. The stochastic state space model can predict the motion state of the object at the target moment based on the first feature and the second feature through the established dynamic evolution process of the object state along with time and the established mapping from the real observation quantity to the state.
The second feature comprises a motion state of the object in the image set at a specified time tn, which can be taken as an initial state. The target time after the specified time tn may be understood as a time after the specified time tn spaced from the specified time tn by a preset time period. The preset time period may include one or more preset time periods, for example, at least one of 1s, 2s, 3s, and the like. For example, the target time instants may include tn +1, tn +2, tn +3, etc., and the motion state of the object at these target time instants may be determined based on the initial state.
In step S250, a decoder is used to generate a prediction image at the target time based on the predicted motion state. Specifically, a decoder may be used to generate a prediction image at a target time based on the predicted motion state and the image frame in the image set. The image frames in the image set here may be, but are not limited to, the first image frame at a specified time instant. The process of generating the predicted image in this step belongs to a decoding process, and the adopted decoder can comprise a variational self-encoder, namely, the variational self-encoder can be adopted to restore the decoding process of the image based on the low-dimensional intermediate representation, and the process of generating the predicted image based on the predicted motion state can be realized.
For example, fig. 3 is a schematic diagram of generating a predictive image. The existing image set includes the image frame at the time tn-1 and the image frame at the time tn, and the above embodiment can generate a predicted image at the time tn +1, and the like. In these images, the grass of a football pitch is the background area and the football is the object in motion. The generated prediction image includes the background and the object, and the reasonable state of the object, including a reasonable posture, a reasonable shape, a reasonable position, and the like.
In another embodiment of the present specification, the image set may further include a plurality of position information of the object in a plurality of image frames. For example, for each image frame in which the object exists, the position information of the object at the time corresponding to the image frame is acquired in advance. The position information may be a position of the object in a ground coordinate system, or may be a position of the object in a camera coordinate system or an image coordinate system. The position information may be determined in advance based on data such as the position of the object in the image, camera internal parameters, and camera external parameters. In this way, the image set has a correspondence relationship between the image frames, the time stamps, and the position information.
In step S220, when the image set is input into the feature extraction model to obtain the first feature of the image set, the image frames in the image set and the corresponding position information may be specifically input into the feature extraction model, and the first feature of the image set is determined by the feature extraction model. The static parameters of the object and the environment are extracted from the image frame based on the position information of the object, and the accuracy of the extracted static parameters can be improved.
In this embodiment, when the image frames in the image set and the corresponding position information are input into the feature extraction model, the feature extraction model may specifically determine corresponding sub-features based on each two consecutive frames of images in the image set and the corresponding position information, and perform feature aggregation on the sub-features to obtain the first feature of the image set.
Before the embodiments shown in steps S210 to S250 are performed, the feature extraction model and the state prediction model may be trained. Both models can be implemented using trained neural networks. In one embodiment, the two models may be trained jointly. In training the feature extraction model and the state prediction model, a large number of sample image sets, which contain a plurality of sample image frames, and an object in which the position is moved and an environment other than the object, may be used for training. And inputting the sample image set into a feature extraction model to obtain a first sample feature of the sample image set, wherein the first sample feature comprises the static parameters of the object and the static parameters of the environment. A second sample feature of the object is determined from the sample image set, the second sample feature comprising a state of motion of the object at a specified time in the sample image set. And inputting the first sample characteristic and the second sample characteristic into a state prediction model to obtain a predicted motion state of at least one target time after the specified time, and generating a predicted image of the target time by adopting a decoder based on the predicted motion state. A prediction loss is determined based on a difference between the predicted image and a sample image frame at a target time in the sample image set, and the feature extraction model and the state prediction model are updated in a direction to reduce the prediction loss.
When the model is trained, the state prediction model can be trained first, and when the state prediction model is trained, the feature extraction model is trained in a manner similar to the joint training.
In one embodiment, in training the feature extraction model and the state prediction model, the training may be based on a multi-step predictive loss function. For example, when a motion state is predicted, multi-step prediction may be performed based on an initial state of a target, motion states of the target at a plurality of target times after a predetermined time are predicted, predicted images at the plurality of target times are obtained, respectively, and when a prediction loss is determined, the determination may be made based on a difference between the predicted images at the plurality of times and an image frame in a corresponding sample image set. In this embodiment, the loss function may be a multi-step prediction loss function, so as to improve the stability of the long-term prediction.
In the above-described embodiments provided in the present specification, a prediction image at a subsequent time can be generated by decoding based on a predicted motion state by predicting the motion state of an object in an image set. The embodiment can be applied to the technologies of automatic driving, robots and the like. In the field of automatic driving, vehicle state prediction or obstacle prediction may be performed based on the generated prediction image. In the field of robots, obstacles can be predicted based on the generated predicted image.
In another application scenario, some video data may be damaged due to the fact that the video data is older, so that the video cannot be displayed. By adopting the above embodiment of the present specification, the damaged data can be repaired, so that the video can be normally displayed.
Based on the above embodiments, corrupted video data may often occur somewhere in the middle of a segment of video, and thus, the designated time may be a time in the middle of the image set, e.g., ti through tj are time periods of corrupted data, the designated time may be a time before the time period of corrupted data, e.g., may be ti-1 or ti-2, etc., and the target time after the designated time may be a time between time periods of corrupted data, e.g., one or more times between ti through tj.
When determining a second feature of the object from the image set, the second feature comprises, in addition to the motion state of the object at the specified moment, the motion state of the object at a first moment in the image set, the first moment being a moment in the image set after a period of time during which the data is corrupted. For example, ti to tj are time periods of data corruption, and the first time may be tj +1 or tj + 2.
When the first feature and the second feature are input to the state prediction model, the state prediction model may determine a predicted motion state of the object at the target time based on a static parameter of the object in the first feature, a static parameter of the environment, a motion state of the object at a specified time in the second feature, and a motion state of the object at the first time. The predicted motion state determined in the mode can consider the motion states of the objects before and after the damaged data, so that the predicted motion state can be well attached to the motion states of the objects in the intact data at two ends of the damaged data, and the generated predicted image is more accurate.
The foregoing describes certain embodiments of the present specification, and other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily have to be in the particular order shown or in sequential order to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 4 is a schematic block diagram of an image generation apparatus based on object state prediction according to an embodiment. The apparatus 400 is deployed in a computing platform that may be implemented by any apparatus, device, platform, cluster of devices, etc. having computing and processing capabilities. This apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus 400 includes:
an image acquisition module 410 configured to acquire an image set formed by a plurality of image frames in a time series, the image set including an object whose position is moved in the plurality of image frames and an environment other than the object;
a feature extraction module 420 configured to input the image set into a feature extraction model to obtain a first feature of the image set, wherein the first feature includes a static parameter of the object and a static parameter of the environment;
a feature determination module 430 configured to determine a second feature of the object from the image set, wherein the second feature comprises a motion state of the object at a specified time in the image set;
a motion prediction module 440 configured to input the first feature and the second feature into a state prediction model to obtain a predicted motion state of the object at least one target time after the specified time;
an image generation module 450 configured to generate, with a decoder, a predicted image at the target time based on the predicted motion state.
In one embodiment, the feature extraction module 420 is specifically configured to:
and determining sub-features corresponding to each two continuous frames of images in the image set through the feature extraction model, and performing feature aggregation on the sub-features to obtain a first feature of the image set.
In one embodiment, the image set further comprises a plurality of position information of said object in a plurality of image frames; the feature extraction module 420 is specifically configured to:
and inputting the image frames in the image set and the corresponding position information into a feature extraction model to obtain a first feature of the image set.
In one embodiment, the feature determining module 430 is specifically configured to: determining a second feature of the object from the set of images based on a variational auto-encoder; or,
the feature determination module 430 is specifically configured to: determining a second feature of the object from the set of images based on a Recurrent Neural Network (RNN).
In one embodiment, the feature determination module 430 is specifically configured to:
a first image frame at a specified time and a previous image frame to the first image frame are selected from the set of images, and a second feature of the object is determined from the first image frame and the previous image frame.
In one embodiment, in training the feature extraction model and the state prediction model, the training is based on a multi-step predictive loss function.
In one embodiment, the static parameters of the object include at least one of mass and volume of the object, and the static parameters of the environment include at least one of ground friction and air resistance in the environment.
In one embodiment, the feature extraction model is implemented by using a convolutional neural network CNN, a deep neural network DNN or a multilayer perceptron MLP; the state prediction model is obtained based on a random state space model.
The above device embodiments correspond to the method embodiments, and specific descriptions may refer to descriptions of the method embodiments, which are not repeated herein. The device embodiment is obtained based on the corresponding method embodiment, has the same technical effect as the corresponding method embodiment, and for the specific description, reference may be made to the corresponding method embodiment.
Embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of fig. 1 to 3.
The embodiment of the present specification further provides a computing device, which includes a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in any one of fig. 1 to 3.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the storage medium and the computing device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments further describe the objects, technical solutions and advantages of the embodiments of the present invention in detail. It should be understood that the above description is only exemplary of the embodiments of the present invention, and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, or improvement made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (18)

1. An image generation method based on object state prediction, performed by a computing platform, the method comprising:
acquiring an image set formed by a plurality of image frames according to a time sequence, wherein the image set comprises an object with a moving position in the image frames and an environment outside the object;
inputting the image set into a feature extraction model to obtain first features of the image set, wherein the first features comprise static parameters of the object and static parameters of the environment;
determining a second feature of the object from the set of images, including a motion state of the object at a specified time in the set of images;
inputting the first characteristic and the second characteristic into a state prediction model to obtain a predicted motion state of the object at least one target moment after the specified moment;
and generating a prediction image at the target time based on the predicted motion state by using a decoder.
2. The method of claim 1, wherein said step of inputting said image set into a feature extraction model to obtain a first feature of said image set comprises:
and determining sub-features corresponding to each two continuous frames of images in the image set through the feature extraction model, and performing feature aggregation on the sub-features to obtain a first feature of the image set.
3. The method of claim 1, the image set further comprising a plurality of location information of the object in a plurality of image frames; the step of inputting the image set into a feature extraction model to obtain a first feature of the image set includes:
and inputting the image frames in the image set and the corresponding position information into a feature extraction model to obtain a first feature of the image set.
4. The method of claim 1, the step of determining a second feature of the object from the set of images, comprising:
determining a second feature of the object from the set of images based on a variational auto-encoder;
alternatively, a second feature of the object is determined from the set of images based on a recurrent neural network RNN.
5. The method of claim 1 or 4, the step of determining a second feature of the object from the set of images, comprising:
a first image frame at a specified time and a previous image frame to the first image frame are selected from the set of images, and a second feature of the object is determined from the first image frame and the previous image frame.
6. The method of claim 1, wherein in training the feature extraction model and the state prediction model, training is based on a multi-step predictive loss function.
7. The method of claim 1, the static parameters of the object comprising at least one of a mass and a volume of the object, the static parameters of the environment comprising at least one of ground friction and air resistance in the environment.
8. The method of claim 1, the feature extraction model is implemented with a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), or a multi-layered perceptron (MLP); the state prediction model is obtained based on a random state space model.
9. An image generation apparatus based on object state prediction, deployed in a computing platform, the apparatus comprising:
an image acquisition module configured to acquire an image set formed by a plurality of image frames in a time series, the image set including an object whose position is moved in the plurality of image frames and an environment other than the object;
a feature extraction module configured to input the image set into a feature extraction model to obtain a first feature of the image set, wherein the first feature comprises a static parameter of the object and a static parameter of the environment;
a feature determination module configured to determine a second feature of the object from the image set, wherein the second feature comprises a motion state of the object at a specified time in the image set;
a motion prediction module configured to input the first feature and the second feature into a state prediction model to obtain a predicted motion state of the object at least one target time after the specified time;
an image generation module configured to generate, with a decoder, a predicted image at the target time based on the predicted motion state.
10. The apparatus of claim 9, the feature extraction module being specifically configured to:
and determining sub-features corresponding to each two continuous frames of images in the image set through the feature extraction model, and performing feature aggregation on the sub-features to obtain a first feature of the image set.
11. The apparatus of claim 9, the image set further comprising a plurality of location information of the object in a plurality of image frames; the feature extraction module is specifically configured to:
and inputting the image frames in the image set and the corresponding position information into a feature extraction model to obtain a first feature of the image set.
12. The apparatus of claim 9, the feature determination module being specifically configured to:
determining a second feature of the object from the set of images based on a variational auto-encoder; or,
the feature determination module is specifically configured to:
determining a second feature of the object from the set of images based on a Recurrent Neural Network (RNN).
13. The apparatus according to claim 9 or 12, wherein the feature determination module is specifically configured to:
a first image frame at a specified time and a previous image frame to the first image frame are selected from the set of images, and a second feature of the object is determined from the first image frame and the previous image frame.
14. The apparatus of claim 9, wherein in training the feature extraction model and the state prediction model, training is based on a multi-step predictive loss function.
15. The apparatus of claim 9, the static parameters of the object comprising at least one of a mass and a volume of the object, the static parameters of the environment comprising at least one of ground friction and air resistance in the environment.
16. The apparatus of claim 9, the feature extraction model is implemented with a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), or a multi-layered perceptron (MLP); the state prediction model is obtained based on a random state space model.
17. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-8.
18. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-8.
CN202011465431.7A 2020-12-14 Image generation method and device based on object state prediction Active CN112232322B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011465431.7A CN112232322B (en) 2020-12-14 Image generation method and device based on object state prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011465431.7A CN112232322B (en) 2020-12-14 Image generation method and device based on object state prediction

Publications (2)

Publication Number Publication Date
CN112232322A true CN112232322A (en) 2021-01-15
CN112232322B CN112232322B (en) 2024-08-02

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022257474A1 (en) * 2021-06-11 2022-12-15 荣耀终端有限公司 Image prediction method, electronic device and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108961312A (en) * 2018-04-03 2018-12-07 奥瞳***科技有限公司 High-performance visual object tracking and system for embedded vision system
WO2019037498A1 (en) * 2017-08-25 2019-02-28 腾讯科技(深圳)有限公司 Active tracking method, device and system
US20190079535A1 (en) * 2017-09-13 2019-03-14 TuSimple Training and testing of a neural network method for deep odometry assisted by static scene optical flow
CN109636770A (en) * 2017-10-06 2019-04-16 福特全球技术公司 For the movement of object detection and trajectory predictions and the fusion of external appearance characteristic
CN109889849A (en) * 2019-01-30 2019-06-14 北京市商汤科技开发有限公司 Video generation method, device, medium and equipment
CN110298238A (en) * 2019-05-20 2019-10-01 平安科技(深圳)有限公司 Pedestrian's visual tracking method, model training method, device, equipment and storage medium
CN110825123A (en) * 2019-10-21 2020-02-21 哈尔滨理工大学 Control system and method for automatic following loading vehicle based on motion algorithm
US20200092124A1 (en) * 2019-07-23 2020-03-19 Lg Electronics Inc. METHOD FOR PROVIDING IoT DEVICE INFORMATION, APPARATUS AND INTELLIGENT COMPUTING DEVICE THEREOF
CN110930483A (en) * 2019-11-20 2020-03-27 腾讯科技(深圳)有限公司 Role control method, model training method and related device
CN111179328A (en) * 2019-12-31 2020-05-19 智车优行科技(上海)有限公司 Data synchronization calibration method and device, readable storage medium and electronic equipment
CN111414852A (en) * 2020-03-19 2020-07-14 驭势科技(南京)有限公司 Image prediction and vehicle behavior planning method, device and system and storage medium
CN111540000A (en) * 2020-04-28 2020-08-14 深圳市商汤科技有限公司 Scene depth and camera motion prediction method and device, electronic device and medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019037498A1 (en) * 2017-08-25 2019-02-28 腾讯科技(深圳)有限公司 Active tracking method, device and system
US20190079535A1 (en) * 2017-09-13 2019-03-14 TuSimple Training and testing of a neural network method for deep odometry assisted by static scene optical flow
CN109636770A (en) * 2017-10-06 2019-04-16 福特全球技术公司 For the movement of object detection and trajectory predictions and the fusion of external appearance characteristic
CN108961312A (en) * 2018-04-03 2018-12-07 奥瞳***科技有限公司 High-performance visual object tracking and system for embedded vision system
CN109889849A (en) * 2019-01-30 2019-06-14 北京市商汤科技开发有限公司 Video generation method, device, medium and equipment
CN110298238A (en) * 2019-05-20 2019-10-01 平安科技(深圳)有限公司 Pedestrian's visual tracking method, model training method, device, equipment and storage medium
US20200092124A1 (en) * 2019-07-23 2020-03-19 Lg Electronics Inc. METHOD FOR PROVIDING IoT DEVICE INFORMATION, APPARATUS AND INTELLIGENT COMPUTING DEVICE THEREOF
CN110825123A (en) * 2019-10-21 2020-02-21 哈尔滨理工大学 Control system and method for automatic following loading vehicle based on motion algorithm
CN110930483A (en) * 2019-11-20 2020-03-27 腾讯科技(深圳)有限公司 Role control method, model training method and related device
CN111179328A (en) * 2019-12-31 2020-05-19 智车优行科技(上海)有限公司 Data synchronization calibration method and device, readable storage medium and electronic equipment
CN111414852A (en) * 2020-03-19 2020-07-14 驭势科技(南京)有限公司 Image prediction and vehicle behavior planning method, device and system and storage medium
CN111540000A (en) * 2020-04-28 2020-08-14 深圳市商汤科技有限公司 Scene depth and camera motion prediction method and device, electronic device and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022257474A1 (en) * 2021-06-11 2022-12-15 荣耀终端有限公司 Image prediction method, electronic device and storage medium

Similar Documents

Publication Publication Date Title
JP6877630B2 (en) How and system to detect actions
CN106462976B (en) Method for tracking shape in scene observed by asynchronous sensor
JP6861249B2 (en) How to Train a Convolutional Recurrent Neural Network, and How to Semantic Segmentation of Input Video Using a Trained Convolutional Recurrent Neural Network
JP5334944B2 (en) Multi-object tracking using knowledge-based autonomous adaptation at the tracking modeling level
US11100646B2 (en) Future semantic segmentation prediction using 3D structure
CN114723955A (en) Image processing method, device, equipment and computer readable storage medium
US9147114B2 (en) Vision based target tracking for constrained environments
JP2006260527A (en) Image matching method and image interpolation method using same
Kim et al. Simvodis: Simultaneous visual odometry, object detection, and instance segmentation
CN112434679B (en) Rehabilitation exercise evaluation method and device, equipment and storage medium
US20110208685A1 (en) Motion Capture Using Intelligent Part Identification
JP2020126617A (en) Learning method and learning device for removing jittering on video acquired through shaking camera by using a plurality of neural networks for fault tolerance and fluctuation robustness, and testing method and testing device using the same
CN112861808B (en) Dynamic gesture recognition method, device, computer equipment and readable storage medium
CN110738687A (en) Object tracking method, device, equipment and storage medium
CN116188684A (en) Three-dimensional human body reconstruction method based on video sequence and related equipment
WO2020213099A1 (en) Object detection/tracking device, method, and program recording medium
KR102648270B1 (en) Method and apparatus for coordinate and uncertainty estimation in images
Yu et al. Human motion prediction with gated recurrent unit model of multi-dimensional input
CN112232322B (en) Image generation method and device based on object state prediction
CN112232322A (en) Image generation method and device based on object state prediction
CN111898573A (en) Image prediction method, computer device, and storage medium
CN114067371B (en) Cross-modal pedestrian trajectory generation type prediction framework, method and device
US20220383652A1 (en) Monitoring Animal Pose Dynamics from Monocular Images
CN114612526A (en) Joint point tracking method, and Parkinson auxiliary diagnosis method and device
US20230017138A1 (en) System and Method for Generating Training Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant