CN113705402A

CN113705402A - Video behavior prediction method, system, electronic device and storage medium

Info

Publication number: CN113705402A
Application number: CN202110950812.2A
Authority: CN
Inventors: 徐常胜; 杨小汕; 黄毅
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-11-26

Abstract

The invention provides a video behavior prediction method, a video behavior prediction system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for performing dynamic relational modeling on historical time characteristics and state characteristics of a target video at a future time through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-modal characteristics after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result. The video behavior prediction method, the video behavior prediction system, the electronic equipment and the storage medium can effectively capture multi-modal dynamic relation changes of historical segments and future segments in videos, and can more accurately predict future occurrence behaviors of the videos through the knowledge distillation optimized graph convolution neural network.

Description

Video behavior prediction method, system, electronic device and storage medium

Technical Field

The present invention relates to the field of video analysis technologies, and in particular, to a method, a system, an electronic device, and a storage medium for video behavior prediction.

Background

With the rapid development of the computer and internet of things technologies, the future behavior prediction technology based on video has increasingly wide practical application scenes in the fields of automatic driving, man-machine interaction, wearable device assistants and the like.

In the conventional video behavior prediction method, after context modeling is performed on an observed video, a hidden state representation of the observed video is directly used for generating future behavior characteristics, so that behavior prediction is realized. However, this way of directly predicting future behavior based on past video segments ignores the strong correlation that potentially exists between past and future behavior.

In addition, in the training stage of the model, the traditional video behavior prediction method does not consider using a training sample containing a future video segment, so that the learning of the associated knowledge between the past behavior and the future behavior is insufficient, and the obtained behavior prediction result is not accurate and reliable.

Disclosure of Invention

The invention provides a video behavior prediction method, a video behavior prediction system, electronic equipment and a storage medium, which are used for solving the technical problem that video behavior prediction is not accurate and reliable enough in the prior art.

In a first aspect, the present invention provides a video behavior prediction method, including:

acquiring a target video to be predicted;

inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;

the video behavior prediction model is used for performing dynamic relational modeling on historical time characteristics and predicted state characteristics of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-modal characteristics after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.

According to the video behavior prediction method provided by the invention, the training process of the video behavior prediction model comprises the following steps:

feature extraction: extracting historical moment features of an observation video in a training set, performing multi-mode feature learning on the historical moment features, and predicting to obtain state features of future moments;

dynamic relational modeling: performing dynamic relational modeling on the historical moment characteristics and the state characteristics of the future moment of the observation video through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;

network optimization: acquiring a complete video, performing sequence dynamic relation modeling on the complete video, distilling feature knowledge and relation knowledge of the complete video into the convolutional neural network respectively, and performing multi-mode feature mutual learning and relation mutual learning to obtain an optimized convolutional neural network; wherein the complete video comprises a video history segment and a real future segment;

feature fusion: and fusing the updated graph node characteristics of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.

According to the video behavior prediction method provided by the invention, the characteristic extraction process comprises the following steps:

extracting historical moment features of observation videos in a training set, wherein the historical moment features comprise video features of multiple modes;

respectively carrying out sequence context modeling on the video characteristics of each mode in the historical moment characteristics, and mapping the video characteristics of each mode to the same dimension;

and according to the video characteristics of each mode after modeling and unifying dimensionality of the sequence context, predicting to obtain the state characteristics of the future moment.

According to the video behavior prediction method provided by the invention, the video features of the multiple modalities comprise: RGB visual features, optical flow features, and target object features.

According to the video behavior prediction method provided by the invention, the updated graph node characteristics of each mode are fused, and the fusion expression is as follows:

wherein y is the final predicted future behavior occurrence probability, m is the video feature modality,

is a graph node characteristic at the l +1 th time, W^mAs weight parameters of the behavior classifier, b^mIs a bias parameter of the behavior classifier.

According to the video behavior prediction method provided by the invention, the dynamic relation modeling process comprises the following steps:

establishing a graph convolution neural network for behavior prediction;

taking the video segment hidden state characteristics in the observed video as graph network nodes, and constructing a node characteristic matrix corresponding to each modal video characteristic;

respectively calculating the dynamic node relation corresponding to each modal video characteristic according to the node characteristic matrix;

and respectively updating the graph node characteristics in the characteristic graph corresponding to the modal video characteristics according to the node characteristic matrix and the dynamic node relation.

According to the video behavior prediction method provided by the invention, the network optimization process comprises the following steps:

acquiring a complete video, wherein the complete video comprises a video historical fragment and a real future fragment;

performing sequence dynamic relation modeling on the complete video, and learning to obtain a teacher model;

distilling feature knowledge and relationship knowledge of the teacher model into the graph convolution neural network respectively;

and learning complementary information among the characteristics of the modal videos by characteristic mutual learning and relation mutual learning to obtain the optimized graph convolution neural network.

In a second aspect, the present invention also provides a video behavior prediction system, including:

the acquisition module is used for acquiring a target video to be predicted;

the behavior prediction module is used for inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;

In a third aspect, the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements any of the steps of the video behavior prediction method when executing the program.

In a fourth aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of any of the video behavior prediction methods described above.

According to the video behavior prediction method, the video behavior prediction system, the electronic equipment and the storage medium, dynamic relation modeling is carried out on the historical moment characteristics of the target video and the predicted state characteristics of the future moment, the dynamic relation between the historical segments and the future segments in the video can be effectively inferred, multi-modal dynamic relation changes of the historical segments and the future segments in the video can be effectively captured, and finally, future occurrence behaviors of the video can be predicted more accurately through the knowledge distillation optimized graph convolution neural network.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a video behavior prediction method provided by the present invention;

FIG. 2 is a schematic diagram of a training process of a video behavior prediction model;

FIG. 3 is a schematic diagram of the training principle of a video behavior prediction model;

FIG. 4 is a schematic structural architecture diagram of a video behavior prediction system provided by the present invention;

fig. 5 is a schematic structural architecture diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 illustrates a video behavior prediction method provided by an embodiment of the present invention, which includes:

s110: acquiring a target video to be predicted;

s120: inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;

the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.

Referring to fig. 2 and fig. 3, the training process of the video behavior prediction model specifically includes:

s210: and a characteristic extraction step, namely extracting the historical moment characteristics of the observed video in the training set, performing multi-mode characteristic learning on the historical moment characteristics, and predicting to obtain the state characteristics of the future moment.

In the step, multi-modal information of the video is mainly considered, and multi-modal feature learning is carried out on the observation video in the training set. For each modal feature of the observed video, context sequence dependencies are modeled and state features at future times are predicted.

Specifically, first, for each video in the training data set, the RGB visual features of the video are extracted using a convolutional neural network, which is denoted as

And optical flow features, note

Target object features were extracted using fast RCNN, notation

Wherein D_r,D_fAnd D_oRepresenting the dimensions of the RGB visual features, optical flow features, and target object features, respectively. i is the index of the video segment, and the input video contains l segments in total.

Then, the RGB visual characteristics { r } are respectively subjected to a Grid Recurrent Unit (GRU) network by using 3 gate-controlled circulation units₁,r₂,…,r_lF, optical flow characteristics₁,f₂,…,f_lAnd target object characteristics o₁,o₂,…,o_lCarry on the context modeling of the sequence, map three kinds of characteristics to the unified dimension D at the same time_hAnd performing the processing operation on the three modal characteristics to obtain an expression corresponding to the historical time characteristics as follows:

wherein the content of the first and second substances,

finally, 3 stepping Gated current units (PGRUs) are designed to predict the multi-modal video features of the future time node, and the expression of the obtained prediction result is as follows:

wherein the content of the first and second substances,

and

three modal characteristics for a predicted future time instant.

S220: and a dynamic relation modeling step, namely performing dynamic relation modeling on the historical moment characteristics and the state characteristics of the future moment of the observed video through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode.

In the step, a Graph Convolutional neural network (GCN) is used for carrying out dynamic relation modeling on the historical time characteristics in the observed video and the predicted state characteristics of the future time, and further reasoning the upcoming behaviors of the future time.

Specifically, the operation of the GCN in this embodiment is defined as:

wherein, X is an input matrix formed by arranging all nodes in the graph network. A is an adjacency matrix describing the relationship between nodes in the graph convolution neural network. W is the network parameter of the GCN. ReLu is a nonlinear activation function.

The hidden state characteristics of the modeled video segments are used as graph network nodes to form node characteristic matrixes of 3 modes, and the three matrixes are respectively

And

and then, calculating a dynamic node relation according to the node characteristics, wherein the calculation formulas are respectively as follows:

wherein A is^r(i,j)、A^f(i, j) and A^o(i, j) represent the relationship of the ith and jth nodes in the 3-modality relationship graph, respectively.

And finally, respectively updating the node characteristics of the characteristic graphs of the 3 modes by using 3 layers of GCN, wherein the expressions of the updated node characteristics of the graphs are respectively as follows:

wherein the content of the first and second substances,

and

the updated graph node characteristics for the 3 modalities.

S230: network optimization, namely acquiring a complete video, performing sequence dynamic relation modeling on the complete video, distilling feature knowledge and relation knowledge of the complete video into a convolutional neural network respectively, and performing multi-mode feature mutual learning and relation mutual learning to obtain an optimized convolutional neural network; wherein the complete video comprises historical segments of the video and real future segments.

In the step, the complete video containing the video historical fragment and the real future fragment is designed and used for carrying out sequence dynamic relation modeling, a Teacher (Teacher) model is learned, and the relation knowledge of the Teacher network is distilled into the graph convolution neural network, namely, into a Student (Student) model. Future node features in teacher model

And

calculated based on the real future video segments, rather than predicted by the PGRU in S210 described above.

Specifically, the present embodiment distills the knowledge of the teacher model into the convolutional neural network for behavior prediction in S220 using two knowledge distillation strategies, feature distillation and relationship distillation.

The loss function of the characteristic distillation strategy is the 2-norm difference between the graph node characteristics obtained in the teacher model and the student model, namely:

wherein the content of the first and second substances,

and

graph node features derived for a graph convolution neural network (i.e. a student model) for behavior prediction,

and

graph node characteristics obtained for the teacher model.

The loss function of the relational distillation strategy is the Kullback-Leibler divergence between the graph relationship matrices obtained in the teacher model and the student model, namely:

wherein A is^r、A^fAnd A^oAll are graphs obtained by student modelsThe relationship matrix is a matrix of the relationship,

and

and obtaining a graph relation matrix for the teacher model.

Specifically, the calculation method of the Kullback-Leibler divergence is as follows:

D_KL(p,q)＝E[log(p)-log(q)] (16)

meanwhile, the complementary information between the three video modes is learned through two multi-mode mutual learning strategies of feature mutual learning and relation mutual learning. The loss function of feature mutual learning is 2-norm difference between graph node features obtained in the step graph convolution neural network, namely:

the loss function of the relationship mutual learning is the Kullback-Leibler divergence between the graph relationship matrices obtained in the graph convolution neural network, namely:

L_{mu_rel}＝D_KL(A^f,A^r)+D_KL(A^r,A^o)+D_KL(A^r,A^f)+D_KL(A^r,A^f)D_KL(A^o,A^r)+D_KL(A^f,A^r) (18)

through the knowledge distillation and mutual learning process, the graph convolution neural network for behavior prediction can be optimized, and the accuracy and reliability of network processing data are improved.

S240: and a feature fusion step, namely fusing the updated graph node features of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.

In the step, the multi-modal prediction results obtained in the step S220 are fused, and are learned and optimized under a unified framework, and finally, the video behavior prediction result is output.

The multi-modal fusion strategy involved in feature fusion is:

wherein y is the final predicted future behavior probability distribution, m is the video feature modality,

is the graph node characteristic at the l +1 th time obtained in S220, W^mAs weight parameters of the behavior classifier, b^mIs a bias parameter of the behavior classifier.

The learning optimization loss function of the unified framework is as follows:

wherein L is_ceIn order to be a function of the cross-entropy loss,

is a true future behavior tag. L is_{kd_fea}、L_{kd_rel}、L_{mu_fea}And L_{mu_rel}As a function of the loss in knowledge distillation and multimodal mutual learning.

Given the great success of graph-convolution neural networks in dynamic relational modeling, few approaches have applied GCN to video-based behavior prediction. In order to enable the GCN to effectively model the relationship between the past behavior and the future behavior of a video correspondant and fully utilize the dynamic association relationship between the past behavior and the future behavior of the complete video segment learning, the embodiment of the invention fully considers three aspects of multi-mode feature learning, global relationship modeling and complete video segment relationship knowledge distillation, and provides the video behavior prediction method.

The video behavior prediction system provided by the present invention is described below, and the video behavior prediction system described below and the video behavior prediction method described above may be referred to in correspondence with each other.

Referring to fig. 4, a video behavior prediction system according to an embodiment of the present invention includes:

an obtaining module 410, configured to obtain a target video to be predicted;

the behavior prediction module 420 is configured to input the target video to the video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;

The behavior prediction module 420 enables prediction of future behavior in video through a video behavior prediction model, and in particular, with respect to a training portion of the video behavior prediction model, includes:

the characteristic extraction unit is used for extracting the historical moment characteristics of the observed videos in the training set, performing multi-mode characteristic learning on the historical moment characteristics and predicting to obtain the state characteristics of the future moment;

the dynamic relation modeling unit is used for carrying out dynamic relation modeling on the historical moment characteristics and the state characteristics of the future moment of the observation video through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;

the network optimization unit is used for acquiring a complete video, performing sequence dynamic relation modeling on the complete video, distilling the feature knowledge and the relation knowledge of the complete video into the atlas neural network respectively, and performing multi-mode feature mutual learning and relation mutual learning to obtain an optimized atlas neural network; wherein, the complete video comprises a video history segment and a real future segment;

and the feature fusion unit is used for fusing the updated graph node features of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.

It can be understood that, the feature extraction unit first needs to extract the historical time features of the observed videos in the training set, where the historical time features include video features of multiple modalities, and in this embodiment, features of three modalities, namely RGB visual features, optical flow features, and target object features, are used; then, performing sequence context modeling on the video features of each modality in the historical moment features respectively, and mapping the video features of each modality to the same dimension; and finally, according to the modal video characteristics after the sequence context modeling and dimension unification, predicting to obtain the state characteristics of the future moment.

It can be understood that the dynamic relationship modeling unit first needs to establish a graph convolution neural network for behavior prediction; then, the video segment hidden state characteristics in the observed video are used as graph network nodes, and a node characteristic matrix corresponding to each modal video characteristic is constructed; then, respectively calculating the dynamic node relation corresponding to each modal video characteristic according to the node characteristic matrix; and finally, respectively updating the graph node characteristics in the characteristic graph corresponding to the modal video characteristics according to the node characteristic matrix and the dynamic node relation.

It can be understood that the network optimization unit first acquires a complete video containing a historical segment of the video and a real future segment; then, performing sequence dynamic relation modeling on the complete video, and learning to obtain a teacher model; distilling the characteristic knowledge and the relation knowledge of the teacher model into a graph convolution neural network respectively; and finally, learning complementary information among the modal video features through feature mutual learning and relation mutual learning to obtain the optimized graph convolution neural network.

According to the video behavior prediction system provided by the embodiment of the invention, dynamic relation modeling is carried out on the historical moment characteristics of the target video and the predicted state characteristics of the future moment through the behavior prediction module, the dynamic relation between the historical segments and the future segments in the video can be effectively inferred, the multi-modal dynamic relation change of the historical segments and the future segments in the video can be effectively captured, and finally, the future occurrence behaviors of the video can be more accurately predicted through the graph convolution neural network optimized through knowledge distillation.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a video behavior prediction method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer being capable of executing the video behavior prediction method provided by the above methods, the method including: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program that when executed by a processor is implemented to perform the video behavior prediction methods provided above, the method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for video behavior prediction, comprising:

acquiring a target video to be predicted;

2. The method according to claim 1, wherein the training process of the video behavior prediction model includes:

3. The method according to claim 2, wherein the feature extraction process comprises:

4. The method according to claim 3, wherein the video features of the plurality of modalities comprise: RGB visual features, optical flow features, and target object features.

5. The method according to claim 4, wherein the updated graph node features of the respective modalities are fused, and a fusion expression is as follows:

6. The method according to claim 2, wherein the dynamic relationship modeling process comprises:

establishing a graph convolution neural network for behavior prediction;

7. The method according to claim 2, wherein the network optimization process comprises:

8. A video behavior prediction system, comprising:

the acquisition module is used for acquiring a target video to be predicted;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the video behavior prediction method according to any of claims 1 to 7.

10. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of the video behavior prediction method according to any one of claims 1 to 7.