CN113705402A - Video behavior prediction method, system, electronic device and storage medium - Google Patents

Video behavior prediction method, system, electronic device and storage medium Download PDF

Info

Publication number
CN113705402A
CN113705402A CN202110950812.2A CN202110950812A CN113705402A CN 113705402 A CN113705402 A CN 113705402A CN 202110950812 A CN202110950812 A CN 202110950812A CN 113705402 A CN113705402 A CN 113705402A
Authority
CN
China
Prior art keywords
video
behavior prediction
neural network
convolution neural
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110950812.2A
Other languages
Chinese (zh)
Inventor
徐常胜
杨小汕
黄毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110950812.2A priority Critical patent/CN113705402A/en
Publication of CN113705402A publication Critical patent/CN113705402A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video behavior prediction method, a video behavior prediction system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for performing dynamic relational modeling on historical time characteristics and state characteristics of a target video at a future time through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-modal characteristics after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result. The video behavior prediction method, the video behavior prediction system, the electronic equipment and the storage medium can effectively capture multi-modal dynamic relation changes of historical segments and future segments in videos, and can more accurately predict future occurrence behaviors of the videos through the knowledge distillation optimized graph convolution neural network.

Description

Video behavior prediction method, system, electronic device and storage medium
Technical Field
The present invention relates to the field of video analysis technologies, and in particular, to a method, a system, an electronic device, and a storage medium for video behavior prediction.
Background
With the rapid development of the computer and internet of things technologies, the future behavior prediction technology based on video has increasingly wide practical application scenes in the fields of automatic driving, man-machine interaction, wearable device assistants and the like.
In the conventional video behavior prediction method, after context modeling is performed on an observed video, a hidden state representation of the observed video is directly used for generating future behavior characteristics, so that behavior prediction is realized. However, this way of directly predicting future behavior based on past video segments ignores the strong correlation that potentially exists between past and future behavior.
In addition, in the training stage of the model, the traditional video behavior prediction method does not consider using a training sample containing a future video segment, so that the learning of the associated knowledge between the past behavior and the future behavior is insufficient, and the obtained behavior prediction result is not accurate and reliable.
Disclosure of Invention
The invention provides a video behavior prediction method, a video behavior prediction system, electronic equipment and a storage medium, which are used for solving the technical problem that video behavior prediction is not accurate and reliable enough in the prior art.
In a first aspect, the present invention provides a video behavior prediction method, including:
acquiring a target video to be predicted;
inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for performing dynamic relational modeling on historical time characteristics and predicted state characteristics of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-modal characteristics after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
According to the video behavior prediction method provided by the invention, the training process of the video behavior prediction model comprises the following steps:
feature extraction: extracting historical moment features of an observation video in a training set, performing multi-mode feature learning on the historical moment features, and predicting to obtain state features of future moments;
dynamic relational modeling: performing dynamic relational modeling on the historical moment characteristics and the state characteristics of the future moment of the observation video through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;
network optimization: acquiring a complete video, performing sequence dynamic relation modeling on the complete video, distilling feature knowledge and relation knowledge of the complete video into the convolutional neural network respectively, and performing multi-mode feature mutual learning and relation mutual learning to obtain an optimized convolutional neural network; wherein the complete video comprises a video history segment and a real future segment;
feature fusion: and fusing the updated graph node characteristics of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.
According to the video behavior prediction method provided by the invention, the characteristic extraction process comprises the following steps:
extracting historical moment features of observation videos in a training set, wherein the historical moment features comprise video features of multiple modes;
respectively carrying out sequence context modeling on the video characteristics of each mode in the historical moment characteristics, and mapping the video characteristics of each mode to the same dimension;
and according to the video characteristics of each mode after modeling and unifying dimensionality of the sequence context, predicting to obtain the state characteristics of the future moment.
According to the video behavior prediction method provided by the invention, the video features of the multiple modalities comprise: RGB visual features, optical flow features, and target object features.
According to the video behavior prediction method provided by the invention, the updated graph node characteristics of each mode are fused, and the fusion expression is as follows:
Figure BDA0003218561840000031
wherein y is the final predicted future behavior occurrence probability, m is the video feature modality,
Figure BDA0003218561840000032
is a graph node characteristic at the l +1 th time, WmAs weight parameters of the behavior classifier, bmIs a bias parameter of the behavior classifier.
According to the video behavior prediction method provided by the invention, the dynamic relation modeling process comprises the following steps:
establishing a graph convolution neural network for behavior prediction;
taking the video segment hidden state characteristics in the observed video as graph network nodes, and constructing a node characteristic matrix corresponding to each modal video characteristic;
respectively calculating the dynamic node relation corresponding to each modal video characteristic according to the node characteristic matrix;
and respectively updating the graph node characteristics in the characteristic graph corresponding to the modal video characteristics according to the node characteristic matrix and the dynamic node relation.
According to the video behavior prediction method provided by the invention, the network optimization process comprises the following steps:
acquiring a complete video, wherein the complete video comprises a video historical fragment and a real future fragment;
performing sequence dynamic relation modeling on the complete video, and learning to obtain a teacher model;
distilling feature knowledge and relationship knowledge of the teacher model into the graph convolution neural network respectively;
and learning complementary information among the characteristics of the modal videos by characteristic mutual learning and relation mutual learning to obtain the optimized graph convolution neural network.
In a second aspect, the present invention also provides a video behavior prediction system, including:
the acquisition module is used for acquiring a target video to be predicted;
the behavior prediction module is used for inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for performing dynamic relational modeling on historical time characteristics and predicted state characteristics of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-modal characteristics after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
In a third aspect, the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements any of the steps of the video behavior prediction method when executing the program.
In a fourth aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of any of the video behavior prediction methods described above.
According to the video behavior prediction method, the video behavior prediction system, the electronic equipment and the storage medium, dynamic relation modeling is carried out on the historical moment characteristics of the target video and the predicted state characteristics of the future moment, the dynamic relation between the historical segments and the future segments in the video can be effectively inferred, multi-modal dynamic relation changes of the historical segments and the future segments in the video can be effectively captured, and finally, future occurrence behaviors of the video can be predicted more accurately through the knowledge distillation optimized graph convolution neural network.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a video behavior prediction method provided by the present invention;
FIG. 2 is a schematic diagram of a training process of a video behavior prediction model;
FIG. 3 is a schematic diagram of the training principle of a video behavior prediction model;
FIG. 4 is a schematic structural architecture diagram of a video behavior prediction system provided by the present invention;
fig. 5 is a schematic structural architecture diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 illustrates a video behavior prediction method provided by an embodiment of the present invention, which includes:
s110: acquiring a target video to be predicted;
s120: inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
Referring to fig. 2 and fig. 3, the training process of the video behavior prediction model specifically includes:
s210: and a characteristic extraction step, namely extracting the historical moment characteristics of the observed video in the training set, performing multi-mode characteristic learning on the historical moment characteristics, and predicting to obtain the state characteristics of the future moment.
In the step, multi-modal information of the video is mainly considered, and multi-modal feature learning is carried out on the observation video in the training set. For each modal feature of the observed video, context sequence dependencies are modeled and state features at future times are predicted.
Specifically, first, for each video in the training data set, the RGB visual features of the video are extracted using a convolutional neural network, which is denoted as
Figure BDA0003218561840000061
And optical flow features, note
Figure BDA0003218561840000062
Figure BDA0003218561840000063
Target object features were extracted using fast RCNN, notation
Figure BDA0003218561840000064
Wherein Dr,DfAnd DoRepresenting the dimensions of the RGB visual features, optical flow features, and target object features, respectively. i is the index of the video segment, and the input video contains l segments in total.
Then, the RGB visual characteristics { r } are respectively subjected to a Grid Recurrent Unit (GRU) network by using 3 gate-controlled circulation units1,r2,…,rlF, optical flow characteristics1,f2,…,flAnd target object characteristics o1,o2,…,olCarry on the context modeling of the sequence, map three kinds of characteristics to the unified dimension D at the same timehAnd performing the processing operation on the three modal characteristics to obtain an expression corresponding to the historical time characteristics as follows:
Figure BDA0003218561840000065
Figure BDA0003218561840000066
Figure BDA0003218561840000067
wherein the content of the first and second substances,
Figure BDA0003218561840000068
finally, 3 stepping Gated current units (PGRUs) are designed to predict the multi-modal video features of the future time node, and the expression of the obtained prediction result is as follows:
Figure BDA0003218561840000069
Figure BDA00032185618400000610
Figure BDA00032185618400000611
wherein the content of the first and second substances,
Figure BDA00032185618400000612
and
Figure BDA00032185618400000613
three modal characteristics for a predicted future time instant.
S220: and a dynamic relation modeling step, namely performing dynamic relation modeling on the historical moment characteristics and the state characteristics of the future moment of the observed video through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode.
In the step, a Graph Convolutional neural network (GCN) is used for carrying out dynamic relation modeling on the historical time characteristics in the observed video and the predicted state characteristics of the future time, and further reasoning the upcoming behaviors of the future time.
Specifically, the operation of the GCN in this embodiment is defined as:
Figure BDA0003218561840000071
wherein, X is an input matrix formed by arranging all nodes in the graph network. A is an adjacency matrix describing the relationship between nodes in the graph convolution neural network. W is the network parameter of the GCN. ReLu is a nonlinear activation function.
The hidden state characteristics of the modeled video segments are used as graph network nodes to form node characteristic matrixes of 3 modes, and the three matrixes are respectively
Figure BDA0003218561840000072
Figure BDA0003218561840000073
And
Figure BDA0003218561840000074
and then, calculating a dynamic node relation according to the node characteristics, wherein the calculation formulas are respectively as follows:
Figure BDA0003218561840000075
Figure BDA0003218561840000076
Figure BDA0003218561840000077
wherein A isr(i,j)、Af(i, j) and Ao(i, j) represent the relationship of the ith and jth nodes in the 3-modality relationship graph, respectively.
And finally, respectively updating the node characteristics of the characteristic graphs of the 3 modes by using 3 layers of GCN, wherein the expressions of the updated node characteristics of the graphs are respectively as follows:
Figure BDA0003218561840000078
Figure BDA0003218561840000079
Figure BDA00032185618400000710
wherein the content of the first and second substances,
Figure BDA00032185618400000711
and
Figure BDA00032185618400000712
the updated graph node characteristics for the 3 modalities.
S230: network optimization, namely acquiring a complete video, performing sequence dynamic relation modeling on the complete video, distilling feature knowledge and relation knowledge of the complete video into a convolutional neural network respectively, and performing multi-mode feature mutual learning and relation mutual learning to obtain an optimized convolutional neural network; wherein the complete video comprises historical segments of the video and real future segments.
In the step, the complete video containing the video historical fragment and the real future fragment is designed and used for carrying out sequence dynamic relation modeling, a Teacher (Teacher) model is learned, and the relation knowledge of the Teacher network is distilled into the graph convolution neural network, namely, into a Student (Student) model. Future node features in teacher model
Figure BDA0003218561840000081
And
Figure BDA0003218561840000082
calculated based on the real future video segments, rather than predicted by the PGRU in S210 described above.
Specifically, the present embodiment distills the knowledge of the teacher model into the convolutional neural network for behavior prediction in S220 using two knowledge distillation strategies, feature distillation and relationship distillation.
The loss function of the characteristic distillation strategy is the 2-norm difference between the graph node characteristics obtained in the teacher model and the student model, namely:
Figure BDA0003218561840000083
wherein the content of the first and second substances,
Figure BDA0003218561840000084
and
Figure BDA0003218561840000085
graph node features derived for a graph convolution neural network (i.e. a student model) for behavior prediction,
Figure BDA0003218561840000086
and
Figure BDA0003218561840000087
graph node characteristics obtained for the teacher model.
The loss function of the relational distillation strategy is the Kullback-Leibler divergence between the graph relationship matrices obtained in the teacher model and the student model, namely:
Figure BDA0003218561840000088
wherein A isr、AfAnd AoAll are graphs obtained by student modelsThe relationship matrix is a matrix of the relationship,
Figure BDA0003218561840000089
and
Figure BDA00032185618400000810
and obtaining a graph relation matrix for the teacher model.
Specifically, the calculation method of the Kullback-Leibler divergence is as follows:
DKL(p,q)=E[log(p)-log(q)] (16)
meanwhile, the complementary information between the three video modes is learned through two multi-mode mutual learning strategies of feature mutual learning and relation mutual learning. The loss function of feature mutual learning is 2-norm difference between graph node features obtained in the step graph convolution neural network, namely:
Figure BDA0003218561840000091
the loss function of the relationship mutual learning is the Kullback-Leibler divergence between the graph relationship matrices obtained in the graph convolution neural network, namely:
Lmu_rel=DKL(Af,Ar)+DKL(Ar,Ao)+DKL(Ar,Af)+DKL(Ar,Af)DKL(Ao,Ar)+DKL(Af,Ar) (18)
through the knowledge distillation and mutual learning process, the graph convolution neural network for behavior prediction can be optimized, and the accuracy and reliability of network processing data are improved.
S240: and a feature fusion step, namely fusing the updated graph node features of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.
In the step, the multi-modal prediction results obtained in the step S220 are fused, and are learned and optimized under a unified framework, and finally, the video behavior prediction result is output.
The multi-modal fusion strategy involved in feature fusion is:
Figure BDA0003218561840000092
wherein y is the final predicted future behavior probability distribution, m is the video feature modality,
Figure BDA0003218561840000093
is the graph node characteristic at the l +1 th time obtained in S220, WmAs weight parameters of the behavior classifier, bmIs a bias parameter of the behavior classifier.
The learning optimization loss function of the unified framework is as follows:
Figure BDA0003218561840000094
wherein L isceIn order to be a function of the cross-entropy loss,
Figure BDA0003218561840000095
is a true future behavior tag. L iskd_fea、Lkd_rel、Lmu_feaAnd Lmu_relAs a function of the loss in knowledge distillation and multimodal mutual learning.
Given the great success of graph-convolution neural networks in dynamic relational modeling, few approaches have applied GCN to video-based behavior prediction. In order to enable the GCN to effectively model the relationship between the past behavior and the future behavior of a video correspondant and fully utilize the dynamic association relationship between the past behavior and the future behavior of the complete video segment learning, the embodiment of the invention fully considers three aspects of multi-mode feature learning, global relationship modeling and complete video segment relationship knowledge distillation, and provides the video behavior prediction method.
The video behavior prediction system provided by the present invention is described below, and the video behavior prediction system described below and the video behavior prediction method described above may be referred to in correspondence with each other.
Referring to fig. 4, a video behavior prediction system according to an embodiment of the present invention includes:
an obtaining module 410, configured to obtain a target video to be predicted;
the behavior prediction module 420 is configured to input the target video to the video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
The behavior prediction module 420 enables prediction of future behavior in video through a video behavior prediction model, and in particular, with respect to a training portion of the video behavior prediction model, includes:
the characteristic extraction unit is used for extracting the historical moment characteristics of the observed videos in the training set, performing multi-mode characteristic learning on the historical moment characteristics and predicting to obtain the state characteristics of the future moment;
the dynamic relation modeling unit is used for carrying out dynamic relation modeling on the historical moment characteristics and the state characteristics of the future moment of the observation video through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;
the network optimization unit is used for acquiring a complete video, performing sequence dynamic relation modeling on the complete video, distilling the feature knowledge and the relation knowledge of the complete video into the atlas neural network respectively, and performing multi-mode feature mutual learning and relation mutual learning to obtain an optimized atlas neural network; wherein, the complete video comprises a video history segment and a real future segment;
and the feature fusion unit is used for fusing the updated graph node features of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.
It can be understood that, the feature extraction unit first needs to extract the historical time features of the observed videos in the training set, where the historical time features include video features of multiple modalities, and in this embodiment, features of three modalities, namely RGB visual features, optical flow features, and target object features, are used; then, performing sequence context modeling on the video features of each modality in the historical moment features respectively, and mapping the video features of each modality to the same dimension; and finally, according to the modal video characteristics after the sequence context modeling and dimension unification, predicting to obtain the state characteristics of the future moment.
It can be understood that the dynamic relationship modeling unit first needs to establish a graph convolution neural network for behavior prediction; then, the video segment hidden state characteristics in the observed video are used as graph network nodes, and a node characteristic matrix corresponding to each modal video characteristic is constructed; then, respectively calculating the dynamic node relation corresponding to each modal video characteristic according to the node characteristic matrix; and finally, respectively updating the graph node characteristics in the characteristic graph corresponding to the modal video characteristics according to the node characteristic matrix and the dynamic node relation.
It can be understood that the network optimization unit first acquires a complete video containing a historical segment of the video and a real future segment; then, performing sequence dynamic relation modeling on the complete video, and learning to obtain a teacher model; distilling the characteristic knowledge and the relation knowledge of the teacher model into a graph convolution neural network respectively; and finally, learning complementary information among the modal video features through feature mutual learning and relation mutual learning to obtain the optimized graph convolution neural network.
According to the video behavior prediction system provided by the embodiment of the invention, dynamic relation modeling is carried out on the historical moment characteristics of the target video and the predicted state characteristics of the future moment through the behavior prediction module, the dynamic relation between the historical segments and the future segments in the video can be effectively inferred, the multi-modal dynamic relation change of the historical segments and the future segments in the video can be effectively captured, and finally, the future occurrence behaviors of the video can be more accurately predicted through the graph convolution neural network optimized through knowledge distillation.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a video behavior prediction method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer being capable of executing the video behavior prediction method provided by the above methods, the method including: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program that when executed by a processor is implemented to perform the video behavior prediction methods provided above, the method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for performing dynamic relational modeling on historical moment features and predicted state features of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-mode features after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for video behavior prediction, comprising:
acquiring a target video to be predicted;
inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for performing dynamic relational modeling on historical time characteristics and predicted state characteristics of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-modal characteristics after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
2. The method according to claim 1, wherein the training process of the video behavior prediction model includes:
feature extraction: extracting historical moment features of an observation video in a training set, performing multi-mode feature learning on the historical moment features, and predicting to obtain state features of future moments;
dynamic relational modeling: performing dynamic relational modeling on the historical moment characteristics and the state characteristics of the future moment of the observation video through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;
network optimization: acquiring a complete video, performing sequence dynamic relation modeling on the complete video, distilling feature knowledge and relation knowledge of the complete video into the convolutional neural network respectively, and performing multi-mode feature mutual learning and relation mutual learning to obtain an optimized convolutional neural network; wherein the complete video comprises a video history segment and a real future segment;
feature fusion: and fusing the updated graph node characteristics of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.
3. The method according to claim 2, wherein the feature extraction process comprises:
extracting historical moment features of observation videos in a training set, wherein the historical moment features comprise video features of multiple modes;
respectively carrying out sequence context modeling on the video characteristics of each mode in the historical moment characteristics, and mapping the video characteristics of each mode to the same dimension;
and according to the video characteristics of each mode after modeling and unifying dimensionality of the sequence context, predicting to obtain the state characteristics of the future moment.
4. The method according to claim 3, wherein the video features of the plurality of modalities comprise: RGB visual features, optical flow features, and target object features.
5. The method according to claim 4, wherein the updated graph node features of the respective modalities are fused, and a fusion expression is as follows:
Figure FDA0003218561830000021
wherein y is the final predicted future behavior occurrence probability, m is the video feature modality,
Figure FDA0003218561830000022
is a graph node characteristic at the l +1 th time, WmAs weight parameters of the behavior classifier, bmIs a bias parameter of the behavior classifier.
6. The method according to claim 2, wherein the dynamic relationship modeling process comprises:
establishing a graph convolution neural network for behavior prediction;
taking the video segment hidden state characteristics in the observed video as graph network nodes, and constructing a node characteristic matrix corresponding to each modal video characteristic;
respectively calculating the dynamic node relation corresponding to each modal video characteristic according to the node characteristic matrix;
and respectively updating the graph node characteristics in the characteristic graph corresponding to the modal video characteristics according to the node characteristic matrix and the dynamic node relation.
7. The method according to claim 2, wherein the network optimization process comprises:
acquiring a complete video, wherein the complete video comprises a video historical fragment and a real future fragment;
performing sequence dynamic relation modeling on the complete video, and learning to obtain a teacher model;
distilling feature knowledge and relationship knowledge of the teacher model into the graph convolution neural network respectively;
and learning complementary information among the characteristics of the modal videos by characteristic mutual learning and relation mutual learning to obtain the optimized graph convolution neural network.
8. A video behavior prediction system, comprising:
the acquisition module is used for acquiring a target video to be predicted;
the behavior prediction module is used for inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for performing dynamic relational modeling on historical time characteristics and predicted state characteristics of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and fusing multi-modal characteristics after dynamic relational modeling based on the optimized graph convolution neural network to obtain a video behavior prediction result.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the video behavior prediction method according to any of claims 1 to 7.
10. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of the video behavior prediction method according to any one of claims 1 to 7.
CN202110950812.2A 2021-08-18 2021-08-18 Video behavior prediction method, system, electronic device and storage medium Pending CN113705402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110950812.2A CN113705402A (en) 2021-08-18 2021-08-18 Video behavior prediction method, system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110950812.2A CN113705402A (en) 2021-08-18 2021-08-18 Video behavior prediction method, system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN113705402A true CN113705402A (en) 2021-11-26

Family

ID=78653369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110950812.2A Pending CN113705402A (en) 2021-08-18 2021-08-18 Video behavior prediction method, system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN113705402A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114310917A (en) * 2022-03-11 2022-04-12 山东高原油气装备有限公司 Joint track error compensation method for oil pipe transfer robot
WO2023142552A1 (en) * 2022-01-27 2023-08-03 苏州大学 Action prediction method for unknown category

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079601A (en) * 2019-12-06 2020-04-28 中国科学院自动化研究所 Video content description method, system and device based on multi-mode attention mechanism
CN111259804A (en) * 2020-01-16 2020-06-09 合肥工业大学 Multi-mode fusion sign language recognition system and method based on graph convolution
CN111488815A (en) * 2020-04-07 2020-08-04 中山大学 Basketball game goal event prediction method based on graph convolution network and long-time and short-time memory network
CN111860353A (en) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 Video behavior prediction method, device and medium based on double-flow neural network
US20200394499A1 (en) * 2019-06-12 2020-12-17 Sri International Identifying complex events from hierarchical representation of data set features
CN112183391A (en) * 2020-09-30 2021-01-05 中国科学院计算技术研究所 First-view video behavior prediction system and method
CN112541440A (en) * 2020-12-16 2021-03-23 中电海康集团有限公司 Subway pedestrian flow network fusion method based on video pedestrian recognition and pedestrian flow prediction method
CN112668438A (en) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 Infrared video time sequence behavior positioning method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200394499A1 (en) * 2019-06-12 2020-12-17 Sri International Identifying complex events from hierarchical representation of data set features
CN111079601A (en) * 2019-12-06 2020-04-28 中国科学院自动化研究所 Video content description method, system and device based on multi-mode attention mechanism
CN111259804A (en) * 2020-01-16 2020-06-09 合肥工业大学 Multi-mode fusion sign language recognition system and method based on graph convolution
CN111488815A (en) * 2020-04-07 2020-08-04 中山大学 Basketball game goal event prediction method based on graph convolution network and long-time and short-time memory network
CN111860353A (en) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 Video behavior prediction method, device and medium based on double-flow neural network
CN112183391A (en) * 2020-09-30 2021-01-05 中国科学院计算技术研究所 First-view video behavior prediction system and method
CN112541440A (en) * 2020-12-16 2021-03-23 中电海康集团有限公司 Subway pedestrian flow network fusion method based on video pedestrian recognition and pedestrian flow prediction method
CN112668438A (en) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 Infrared video time sequence behavior positioning method, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YIDING YANG等: ""Distilling Knowledge from Graph Convolutional Networks"", 《2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, 5 August 2020 (2020-08-05) *
于海涛等: ""基于多模态输入的对抗式视频生成方法"", 《计算机研究与发展》, 31 July 2020 (2020-07-31) *
电子科技大学统计机器智能与学习实验室: ""入门学习 | 什么是图卷积网络?行为识别领域新星"", Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/68690795> *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023142552A1 (en) * 2022-01-27 2023-08-03 苏州大学 Action prediction method for unknown category
CN114310917A (en) * 2022-03-11 2022-04-12 山东高原油气装备有限公司 Joint track error compensation method for oil pipe transfer robot
CN114310917B (en) * 2022-03-11 2022-06-14 山东高原油气装备有限公司 Oil pipe transfer robot joint track error compensation method

Similar Documents

Publication Publication Date Title
CN111741330B (en) Video content evaluation method and device, storage medium and computer equipment
CN112633010A (en) Multi-head attention and graph convolution network-based aspect-level emotion analysis method and system
CN113762052A (en) Video cover extraction method, device, equipment and computer readable storage medium
CN113010683B (en) Entity relationship identification method and system based on improved graph attention network
CN113705402A (en) Video behavior prediction method, system, electronic device and storage medium
CN111737432A (en) Automatic dialogue method and system based on joint training model
CN114330966A (en) Risk prediction method, device, equipment and readable storage medium
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN114492601A (en) Resource classification model training method and device, electronic equipment and storage medium
CN114580794B (en) Data processing method, apparatus, program product, computer device and medium
CN116664719A (en) Image redrawing model training method, image redrawing method and device
CN113641797A (en) Data processing method, device, equipment, storage medium and computer program product
CN114647713A (en) Knowledge graph question-answering method, device and storage medium based on virtual confrontation
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN113987236B (en) Unsupervised training method and unsupervised training device for visual retrieval model based on graph convolution network
CN114398973B (en) Media content tag identification method, device, equipment and storage medium
CN117690098B (en) Multi-label identification method based on dynamic graph convolution under open driving scene
CN114818707A (en) Automatic driving decision method and system based on knowledge graph
CN115374927A (en) Neural network model training method, anomaly detection method and device
CN114627085A (en) Target image identification method and device, storage medium and electronic equipment
CN111935259B (en) Method and device for determining target account set, storage medium and electronic equipment
CN111897943A (en) Session record searching method and device, electronic equipment and storage medium
CN111611981A (en) Information identification method and device and information identification neural network training method and device
CN116702784B (en) Entity linking method, entity linking device, computer equipment and storage medium
CN113283394B (en) Pedestrian re-identification method and system integrating context information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination