CN112183391A - First-view video behavior prediction system and method - Google Patents

First-view video behavior prediction system and method Download PDF

Info

Publication number
CN112183391A
CN112183391A CN202011059356.4A CN202011059356A CN112183391A CN 112183391 A CN112183391 A CN 112183391A CN 202011059356 A CN202011059356 A CN 202011059356A CN 112183391 A CN112183391 A CN 112183391A
Authority
CN
China
Prior art keywords
prediction
module
behavior
prediction result
intuition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011059356.4A
Other languages
Chinese (zh)
Inventor
蒋树强
张天予
闵巍庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202011059356.4A priority Critical patent/CN112183391A/en
Publication of CN112183391A publication Critical patent/CN112183391A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a first visual angle video behavior prediction system, which is used for performing behavior prediction according to an existing video, and comprises: the visual feature extraction module is used for extracting visual features in the existing video; the intuition-based prediction module is used for carrying out intuition-based prediction according to the visual features extracted by the visual feature extraction module to obtain a first prediction result; the prediction module based on the analytical reasoning is used for carrying out the prediction based on the analytical reasoning according to the visual characteristics extracted by the visual characteristic extraction module to obtain a second prediction result; and the self-adaptive fusion module is used for organically fusing the first prediction result and the second prediction result by adopting an attention mechanism to obtain a final behavior prediction result. The method organically combines the prediction based on intuition and the prediction based on analytical reasoning, effectively relieves the problem of 'vision gap', can more directly predict the next action of the human from the psychological angle, has higher accuracy, and provides more comprehensive support for the actual engineering application.

Description

First-view video behavior prediction system and method
Technical Field
The present invention relates to the field of video, and in particular, to the field of video behavior prediction, and more particularly, to a first view video behavior prediction system and method.
Background
Human perception and interaction with the outside world is accomplished through a self-centric perspective (i.e., a first perspective). With the development of smart wearable devices, recording video data at a first viewing angle has more and more application scenes, such as virtual reality, human-computer interaction and the like. Since the first perspective is more helpful for the intelligent system to understand the intention and purpose of human beings, it is important to analyze the behavior in the first perspective video, including behavior recognition (recognizing behavior that has already been completed) and behavior prediction (predicting behavior that has not yet occurred). Although the technology of behavior recognition is relatively mature, it is far from sufficient to rely on the behavior recognition technology in practical applications, for example, wearable power-assisted robots need to deduce the real intention of a user in time and help them to take action, so as to provide more detailed services, which requires research on behavior prediction technology. Some research efforts have been directed to first view video behavior prediction[1]Some behavior recognition models are directly applied to the behavior prediction task, so that a certain prediction effect is achieved; some research works[2]In consideration of the uncertainty of the behavior in the future, the behavior prediction is regarded as a multi-label classification task, and a corresponding loss function is designed, so that the prediction accuracy is improved; some research works[3]Behavior prediction is subdivided into two parts, namely summarizing past information and predicting future information.
The first view behavior prediction requires predicting the behavior that may occur next from a segment of video that has already occurred. The existing methods mostly rely on visual features extracted from video data that have occurred, and due to the strong uncertainty in the future, video segments that have occurred and can be observed and non-occurring behaviors that are to be predicted and cannot be observed are visually observedThere is often a large difference, and we call this phenomenon the "visual gap". Most of the existing methods do not effectively utilize other information (such as a text mode) except for the visual mode, and the problem of 'visual gap' is difficult to relieve. Furthermore, first-view behavior prediction requires psychological or cognitive science from humans themselves due to the understanding directly related to human intent[4]The prior art does not consider the equal angle exploration, so that the predicted behavior has larger deviation from the actually occurring behavior, and the prediction accuracy is not high.
The list of documents cited in the background section is as follows:
[1]Damen D,Doughty H,Maria Farinella G,et al.Scaling egocentric vision:The epic-kitchens dataset.In Proceedings of the European Conference on Computer Vision 2018:720-736.
[2]Antonino Furnari,Sebastiano Battiato,and Giovanni Maria Farinella.Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation.In Proceedings of the European Conference on Computer Vision 2018:389–405.
[3]Antonino Furnari and Giovanni Maria Farinella.What Would You ExpectAnticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention.In Proceedings of the IEEE International Conference on Computer Vision 2019:6251–6260.
[4]Robert Hamm.Clinical intuition and clinical analysis:expertise and the cognitive continuum.Professional judgment:Areader in clinical decision making 1988:78–105.
disclosure of Invention
Therefore, the present invention is directed to overcome the above-mentioned drawbacks of the prior art and to provide a new system and method for predicting video behavior from a first view.
The purpose of the invention is realized by the following technical scheme:
according to a first aspect of the present invention, there is provided a first view video behavior prediction system for performing behavior prediction based on an existing video, the system comprising: the visual feature extraction module is used for extracting visual features in the existing video; the intuition-based prediction module is used for carrying out intuition-based prediction according to the visual features extracted by the visual feature extraction module to obtain a first prediction result; the prediction module based on the analytical reasoning is used for carrying out the prediction based on the analytical reasoning according to the visual characteristics extracted by the visual characteristic extraction module to obtain a second prediction result; and the self-adaptive fusion module is used for organically fusing the first prediction result and the second prediction result by adopting an attention mechanism to obtain a final behavior prediction result.
Preferably, the visual feature extraction module comprises a convolutional neural network.
Preferably, the intuition-based prediction module includes: an encoder for generating intermediate variables corresponding to an abstract representation of past information based on visual characteristics, the encoder being an long-and-short-term memory encoder in some embodiments of the invention; a decoder for reading information from intermediate variables generated by the long-and-short-term memory encoder to output future information about future behavior, in some embodiments of the invention, the decoder is a long-and-short-term memory decoder; a classifier for mapping future information output by the decoder to a behavior class to obtain a first prediction result, in some embodiments of the invention, the classifier is a fully-connected classifier.
Preferably, the prediction module based on analytical reasoning comprises: a recognition module for recognizing verbs and nouns in the visual features, the recognition module configured as a fully connected classifier in some embodiments of the present invention; a transition module for transitioning verbs and nouns, respectively, in the visual features identified by the identification module to a next state according to conditional probabilities, the transition module configured as a Markov transition matrix in some embodiments of the invention; a combining module for combining the next state of the verb and the noun transferred by the transfer module into a new behavior to obtain a second prediction result, in some embodiments of the invention, the combining module employs a priori knowledge strategy.
Preferably, the adaptive fusion module includes: the full-connection network is used for analyzing the first prediction result and the second prediction result to respectively obtain a normalization weighting coefficient of the first prediction result and a normalization weighting coefficient of the second prediction result; and the self-adaptive fusion module respectively sums the products of the first prediction result and the second prediction result with the corresponding normalized weighting coefficients to obtain a final behavior prediction result.
According to a second aspect of the present invention, there is provided a method for training the first-view video behavior prediction system according to the first aspect of the present invention, the method comprising the following three steps:
step 1) training an intuition-based prediction module, comprising:
inputting the visual characteristics of the current moment into the intuition-based prediction module, outputting the behavior category of the next moment, and optimizing the intuition-based prediction module by adopting a cross entropy loss function, wherein the loss function of the intuition-based prediction module is as follows:
Figure BDA0002711796580000031
xtfor the visual characteristics at the time t,
Figure BDA0002711796580000032
a category of behavior at time t +1 predicted for an intuition-based prediction module, where θIAll parameters, y, representing an intuition-based prediction modelt+1A real tag representing a behavior class at time t + 1;
adopting a hidden knowledge storage strategy based on text pre-training to store hidden knowledge for the prediction module based on intuition, comprising the following steps: inputting the word vector of the previous behavior of the next moment behavior into an intuition-based prediction module, outputting a prediction result, and calculating a loss function:
Figure BDA0002711796580000033
then updating parameter theta of the intuition-based prediction module by adopting a gradient descent methodI,wtIs yt+1Previous behavior y oftThe word vector of (2);
step 2) training a prediction module based on analytical reasoning, which comprises the following steps:
the visual characteristics are used as the input of a prediction module based on analytical reasoning, a cross entropy loss function optimization analysis reasoning module is adopted, and a gradient descent method is adopted to update the parameters of the prediction module based on analytical reasoning, wherein the loss function of the prediction module based on analytical reasoning is as follows:
Figure BDA0002711796580000041
wherein x istFor the visual characteristics at the time t,
Figure BDA0002711796580000042
a category of behavior predicted for the analytical reasoning-based prediction module at time t +1, where θAParameters representing predictive modules based on analytical reasoning, yt+1A real tag representing a behavior class at time t + 1;
step 3) training the adaptive fusion module, which comprises:
taking the prediction result of the prediction module based on intuition and the prediction result of the prediction module based on analysis and reasoning as input, optimizing the adaptive fusion module by adopting a cross entropy loss function, and updating the parameters of the adaptive fusion module by adopting a gradient descent method, wherein the loss function of the parameters of the adaptive fusion module is as follows:
L=-yt+1log[Fθ(xt)]-(1-yt+1)log[1-Fθ(xt)]
wherein x istAs a visual feature at time t, Fθ(xt) Predicted behavior class at time t +1 for the adaptive fusion module, where θ represents the parameter of the adaptive fusion module, yt+1Truthful representation of behavior class at time t +1A label;
Figure BDA0002711796580000043
a1is a fusion weight of the prediction results output by the intuition-based prediction module, a2Is a fusion weight of the prediction results output by the prediction module based on analytical reasoning;
wherein, the step 1) and the step 2) have no sequence and can run in parallel.
According to a third aspect of the present invention, there is provided a method for performing first-view video behavior prediction on the first-view video behavior prediction system according to the first aspect of the present invention, comprising:
s1, acquiring the observed video of the first visual angle, and extracting visual features in the video;
s2, performing intuitive-based behavior prediction on the visual features extracted in the step S1 to obtain a first prediction result, and simultaneously performing analytical-reasoning-based behavior prediction on the visual features extracted in the step S1 to obtain a second prediction result;
and S3, calculating the fusion weight of the first prediction result and the second prediction result, and fusing the first prediction result and the second prediction result based on the fusion weight to obtain a final behavior prediction result.
Compared with the prior art, the invention has the advantages that:
the method organically combines the prediction based on intuition and the prediction based on analytical reasoning, effectively relieves the problem of 'vision gap', can more directly predict the next action of the human from the psychological angle, has higher accuracy, and provides more comprehensive support for the actual engineering application.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
fig. 1 is a block diagram of a first-view video behavior prediction system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a first view video behavior prediction system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the operation of an intuition-based prediction module according to an embodiment of the invention;
FIG. 4 is a schematic diagram of the operation of a prediction module based on analytical reasoning according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating the operation of an adaptive fusion module according to an embodiment of the present invention;
fig. 6 is a schematic diagram illustrating behavior prediction performed by the first-view video behavior prediction system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
During research, the inventor finds that the first perspective behavior prediction directly relates to understanding of human intention, and the prior art rarely explores how to construct a behavior prediction model from the perspective of psychological or cognitive science. Through research on relevant work in the field of psychology, the inventor finds that the cognitive mode of human can be roughly divided into intuition and analytic reasoning, wherein intuition is a subconscious and integrated thinking process, depends on the storage of implicit knowledge and is difficult to explain by language; analytical reasoning is conscious, in a certain order and can be described in language. When complex problems related to prediction, decision and the like are processed, human intuition and analytical reasoning have advantages respectively, and prediction results with higher accuracy can be obtained by complementation. The inventors therefore constructed a first perspective behavior prediction model that incorporates intuitive and inferential analysis, which can be subdivided into three parts: the device comprises an intuition-based prediction module, an analysis-reasoning-based prediction module and an adaptive fusion module. In consideration of intuitive abstraction, the intuitive-based prediction module models intuition into a black-box-like encoder-decoder structure, preferably through two cascaded long-term memory networks for encoding past information and decoding future information, respectively, and introduces text information on the basis of the encoded information, and proposes a hidden knowledge storage strategy based on text pre-training to solve the problem of "visual gap", because parameters of the module are pre-trained by replacing input from visual features of a generated video with word vectors of generated behaviors, and because there is a time interval between an observable video and a behavior to be predicted, the time interval makes the visual information discontinuous in a time dimension, and the text information can avoid the problem, so richer and structured hidden knowledge can be stored. Considering that the analytical reasoning is more interpretable and tends to process information according to given rules, the inventors have specifically divided the prediction based on analytical reasoning into three steps: recognizing, transferring and combining, since the first perspective behavior is usually expressed in the form of (verb, noun) combinations, and the number of categories of verbs or nouns is often much smaller than the number of categories of behaviors, firstly, information of verbs and nouns is recognized from the input video, secondly, the verbs and nouns are respectively transferred to the next state through a Markov transfer matrix, and finally, considering that some invalid (verb, noun) combinations exist, the verbs and nouns are combined into the predicted behavior by utilizing a priori knowledge (namely, counting the symbiotic relationship of the verbs and nouns). Where the transfer process takes into account the probabilistic relationship between the past and the future, again helping to alleviate the "visual gap" problem. In consideration of the complementary effect of intuition and analytic reasoning, the invention constructs an adaptive fusion module, respectively calculates weights for prediction results based on intuition and analytic reasoning by introducing an attention mechanism, and organically fuses the prediction results of the intuition and the analytic reasoning to obtain a final prediction result.
According to an embodiment of the present invention, as shown in fig. 1, the present invention provides a first-view video behavior prediction system, which broadly comprises a visual feature extraction module for extracting visual features from an existing video; the intuition-based prediction module is used for carrying out intuition-based prediction according to the visual features extracted by the visual feature extraction module to obtain a first prediction result; the prediction module based on the analytical reasoning is used for carrying out the prediction based on the analytical reasoning according to the visual characteristics extracted by the visual characteristic extraction module to obtain a second prediction result; and the self-adaptive fusion module is used for organically fusing the first prediction result and the second prediction result by adopting an attention mechanism to obtain a final behavior prediction result.
As shown in fig. 2, when the prediction system of the present invention is used for prediction, for an input video segment, the prediction system of the present invention firstly uses a convolutional neural network to extract visual features in a visual feature extraction module, then predicts subsequent behaviors that do not occur later based on intuition and analytic reasoning, and finally adaptively fuses the prediction results of the two parts to obtain a final behavior prediction result. The core of the invention is three modules: the system comprises an intuition-based prediction module, an analysis and reasoning-based module and an adaptive fusion module. These three modules will be described in detail below with reference to specific embodiments:
first, prediction module based on intuition
The invention takes the intuitive abstraction into consideration, and models the intuition-based prediction module into a black box-like process: the device consists of an encoder, a decoder and a classifier. As shown in fig. 3, the input of the encoder is the visual characteristics of the observed video (i.e. the past information) and generates an abstract representation (i.e. the intermediate variables) about the past information, the decoder is used for reading the intermediate variables transmitted by the encoder and outputting information (i.e. the future information) about the future behavior, the classifier is used for mapping the information output by the decoder to a specific behavior class to obtain a specific prediction result, and for the convenience of understanding, the result predicted by the intuition-based prediction module is called a first prediction result. In psychology, intuition relies on the storage of implicit knowledge (difficult to interpret in language); in the intuition-based prediction module, the storage of the implicit knowledge is an updating optimization process of the parameters. Because the long-time and short-time memory network is good at storing the time sequence information, the invention respectively realizes the encoder and the decoder by adopting two cascaded long-time and short-time memory networks in the intuition-based prediction module, and realizes the classifier by the fully-connected network. According to one embodiment of the invention, the training process of the module: inputting visual characteristic x at time ttWhen t +1 is outputTemporal behavior categories
Figure BDA0002711796580000071
Wherein theta isIAll parameters representing the long-short memory encoder, the long-short memory decoder and the full-connected classifier are optimized by using a cross entropy function, and the loss function is as follows:
Figure BDA0002711796580000072
wherein y ist+1A real label representing the behavior class at time t +1, the optimization goal of this module being such that LIAt a minimum, the effect is to make the result of the intuitive prediction
Figure BDA0002711796580000073
As close as possible to the genuine label yt+1. Particularly, relying only on the visual features that are not enough to store rich implicit knowledge, as shown in fig. 3, the invention adopts a text pre-training-based implicit knowledge storage strategy, namely, the rest is unchanged, and the visual features x are storedtBy substitution of yt+1Previous behavior y oftWord vector w oftAfter the actions of the long-time and short-time memory encoder, the long-time and short-time memory decoder and the full-connection classifier, a prediction result is obtained
Figure BDA0002711796580000074
From which a loss function is calculated
Figure BDA0002711796580000075
Updating parameter theta by using classical gradient descent algorithm in machine learningIAnd realizing parameter training of the intuitive module. The method introduces a text mode on the basis of a visual mode, uses multi-mode information and can store richer implicit knowledge.
Prediction module based on analysis and reasoning
The present invention allows for analytical reasoning to be more interpretable, to be more inclined to process information according to given rules,the prediction process based on analytical reasoning is specifically divided into three steps: recognition, transfer and binding. Still referring to FIG. 2, since the first perspective behavior is generally expressed in the form of a (verb, noun) combination, the present invention first identifies verbs and nouns, such as "open" and "cupboard", respectively, from the visual characteristics of the input; then the verb and noun are transferred to the next state according to the conditional probability, for example, the verb will be transferred from "open" to "take" with the highest probability, and the noun will be transferred from "cupboard" to "cup" with the highest probability; finally, combining nouns and verbs into a new behavior and avoiding unreasonable combinations as much as possible, for example, the verb "take" and noun "cupboard", the noun "cup" and the verb "open" are unreasonable combinations, while the verb "take" and noun "cup" are reasonable combinations, and the probability of "take cup" combination can be increased and the probability of other combinations can be reduced by using priori knowledge. As shown in fig. 4, the training process of the present module can be summarized as follows: inputting visual characteristic x at time ttRespectively recognizing verb information v at time t by using a fully-connected networktAnd noun information ntThen transfer the matrix T through MarkovvAnd TnRespectively combine v with vtAnd ntShift to time t +1, vt+1=Tv TvtAnd nt+1=Tn TvnFinally, v is transformed using a priori knowledget+1And nt+1Combined into a new behaviour, i.e. at+1=softmax[η(v,a)vt+1+η(n,a)nt+1]Where η (v, a) and η (n, a) denote the prior probability of predicting an action a given verb v or noun, respectively. In the prediction module based on analytical reasoning, the Markov transfer matrix and the priori knowledge involved in combination are obtained by statistics, and no trainable parameter exists, so that the parameter theta of the fully-connected networkAThe parameters that the module needs to be trained. According to one embodiment of the invention, as with the intuition-based prediction module, the analytical reasoning-based prediction module is optimized using a cross-entropy function, the loss function being:
Figure BDA0002711796580000081
for input visual feature xtAfter the full-connection classifier, the Markov transfer matrix and the prior knowledge are used, a prediction result is obtained
Figure BDA0002711796580000082
From this a loss function L is calculatedAUpdating parameter theta by using classical gradient descent algorithm in machine learningAAnd realizing parameter training of the analysis reasoning module.
Adaptive fusion
The invention designs an adaptive fusion module based on an attention mechanism by taking the complementary consideration of intuition and analytic reasoning and aims to predict the result based on intuition
Figure BDA0002711796580000083
And predictive outcomes based on analytical reasoning
Figure BDA0002711796580000084
And (4) organic fusion. As shown in FIG. 5, first, the first step is to separately provide
Figure BDA0002711796580000085
And
Figure BDA0002711796580000086
sending the information into a full-connection network with trainable parameters (the full-connection network is a part of an adaptive fusion module and is used for calculating fusion weight), and respectively obtaining attention scores s corresponding to a prediction result based on intuition and a prediction result based on analytical reasoning1And s2Wherein, in the step (A),
Figure BDA0002711796580000087
and
Figure BDA0002711796580000088
then s is1And s2Normalization is carried out to obtain a weighting coefficient a1And a2(i.e., the fusion weight of the intuition-based prediction result and the analysis-inference-based prediction result) is calculated in the manner of
Figure BDA0002711796580000089
Finally, the prediction results of the two parts are multiplied by the corresponding weighting coefficients and summed to obtain the final prediction result
Figure BDA0002711796580000091
Wherein the parameter θ of the fully-connected network is a parameter that the adaptive fusion module needs to be trained, and as before, the adaptive fusion module is optimized using a cross entropy function, and the loss function is:
L=-yt+1log[Fθ(xt)]-(1-yt+1)log[1-Fθ(xt)]
the input of the adaptive fusion module is an intuition-based prediction result
Figure BDA0002711796580000092
And predictive outcomes based on analytical reasoning
Figure BDA0002711796580000093
They respectively obtain fusion weight a through the action of the same fully-connected network1And a2Thereby obtaining the final prediction result
Figure BDA0002711796580000094
Calculating a loss function L ═ yt+1log[Fθ(xt)]-(1-yt+1)log[1-Fθ(xt)]And updating the parameter theta by using a classical gradient descent algorithm in machine learning to realize parameter training of the self-adaptive module. In the training process, the weight a is fused1And a2Can be dynamically adjusted (by optimizing the cross-entropy loss function adjustment because in the training, the new one is used
Figure BDA0002711796580000095
And
Figure BDA0002711796580000096
after the network is sent into the full-connection network, the parameter theta is updated, and different parameters theta generate different a1And a2) The whole framework can adaptively distribute different fusion weights for intuition or analysis reasoning, so that the prediction results based on intuition and analysis reasoning are organically fused, and the prediction accuracy is further improved.
For a better understanding of the present invention, the invention is described below with reference to an example.
According to an example of the present invention, as shown in fig. 6, feature extraction is performed through a convolutional neural network, information of opened cabinets is contained in a video segment which is known to occur, an intuition-based prediction module performs text pre-training through the information of opened cabinets and prediction of next behavior, and a first prediction result such as 'taking a cup, opening a cabinet, taking a bowl' is obtained; the predication module based on analysis and inference obtains verb information of 'open' and noun information of 'cabinet' through identification, the 'open' and the 'cabinet' are transferred based on prior probability, the next possible state of the 'open' is 'taking, closing and putting down' and the like, the next possible state of the 'cabinet' is 'cup, cabinet, bowl' and the like, the next states of the verb and noun are combined based on the prior probability, and second predication results of 'taking cup, closing cabinet and putting down cup' and the like are obtained; and taking the first prediction result and the second prediction result as the input of the self-adaptive fusion module, calculating respective weights of the first prediction result and the second prediction result by using a full-connection network of the self-adaptive fusion module, then carrying out self-adaptive fusion based on the weights to obtain possible final prediction results, wherein the possible final prediction results may be ' holding a cup, holding a bowl, putting down the cup, closing a cabinet, putting down the bowl ', and the like ', the probability of each prediction result is different according to different fusion weights, and the maximum probability is taken as the final behavior prediction result.
The method organically combines the prediction based on intuition and the prediction based on analytical reasoning, effectively relieves the problem of 'vision gap', can more directly predict the next action of the human from the psychological angle, has higher accuracy, and provides more comprehensive support for the actual engineering application.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (11)

1. A first-view video behavior prediction system for performing behavior prediction based on an existing video, the system comprising:
the visual feature extraction module is used for extracting visual features in the existing video;
the intuition-based prediction module is used for performing vision-based prediction according to the visual features extracted by the visual feature extraction module to obtain a first prediction result;
the prediction module based on the analytical reasoning is used for carrying out the prediction based on the analytical reasoning according to the visual characteristics extracted by the visual characteristic extraction module to obtain a second prediction result;
and the self-adaptive fusion module is used for organically fusing the first prediction result and the second prediction result by adopting an attention mechanism to obtain a final behavior prediction result.
2. The system according to claim 1, wherein the visual feature extraction module comprises a convolutional neural network.
3. The system according to claim 1, wherein the intuition-based prediction module comprises:
an encoder for generating an intermediate variable corresponding to an abstract representation of past information based on visual characteristics;
the decoder is used for reading information according to intermediate variables generated by the long-time memory encoder so as to output future information about future behaviors;
and the classifier is used for mapping the future information output by the decoder to a behavior class to obtain a first prediction result.
4. The system according to claim 3,
the encoder is a long-time and short-time memory encoder, the decoder is a long-time and short-time memory decoder, and the classifier is a full-connection classifier.
5. The first-perspective video behavior prediction system of claim 1, wherein the analytical inference based prediction module comprises:
the recognition module is used for recognizing verbs and nouns in the visual features;
the transfer module is used for transferring verbs and nouns in the visual features identified by the identification module to the next state according to the conditional probability respectively;
and the combination module is used for combining the next state of the verb and the noun transferred by the transfer module into a new behavior to obtain a second prediction result.
6. The system according to claim 5, wherein said first-view video behavior prediction unit is further configured to,
the identification module is configured as a fully connected classifier;
the transition module is configured as a Markov transition matrix;
the combining module employs a priori knowledge strategy.
7. The system according to claim 1, wherein the adaptive fusion module comprises: the full-connection network is used for analyzing the first prediction result and the second prediction result to respectively obtain a normalization weighting coefficient of the first prediction result and a normalization weighting coefficient of the second prediction result;
and the self-adaptive fusion module respectively sums the products of the first prediction result and the second prediction result with the corresponding normalized weighting coefficients to obtain a final behavior prediction result.
8. A method for training a first view video behavior prediction system according to any of claims 1-7, the method comprising the following three steps:
step 1) training an intuition-based prediction module, comprising:
inputting the visual characteristics of the current moment into the intuition-based prediction module, outputting the behavior category of the next moment, and optimizing the intuition-based prediction module by adopting a cross entropy loss function, wherein the loss function of the intuition-based prediction module is as follows:
Figure FDA0002711796570000021
xtfor the visual characteristics at the time t,
Figure FDA0002711796570000022
a category of behavior at time t +1 predicted for an intuition-based prediction module, where θIAll parameters, y, representing an intuition-based prediction modelt+1A real tag representing a behavior class at time t + 1;
adopting a hidden knowledge storage strategy based on text pre-training to store hidden knowledge for the prediction module based on intuition, comprising the following steps: and (3) enabling the word vector of the previous behavior of the next moment behavior to belong to an intuition-based prediction module, outputting a prediction result, and calculating a loss function:
Figure FDA0002711796570000023
then updating parameter theta of the intuition-based prediction module by adopting a gradient descent methodI,wtIs yt+1Previous behavior y oftThe word vector of (2);
step 2) training a prediction module based on analytical reasoning, which comprises the following steps:
the visual characteristics are used as the input of a prediction module based on analytical reasoning, a cross entropy loss function optimization analysis reasoning module is adopted, and a gradient descent method is adopted to update the parameters of the prediction module based on analytical reasoning, wherein the loss function of the prediction module based on analytical reasoning is as follows:
Figure FDA0002711796570000024
wherein x istFor the visual characteristics at the time t,
Figure FDA0002711796570000031
a category of behavior predicted for the analytical reasoning-based prediction module at time t +1, where θAParameters representing predictive modules based on analytical reasoning, yt+1A real tag representing a behavior class at time t + 1;
step 3) training the adaptive fusion module, which comprises:
taking the prediction result of the prediction module based on intuition and the prediction result of the prediction module based on analysis and reasoning as input, optimizing the adaptive fusion module by adopting a cross entropy loss function, and updating the parameters of the adaptive fusion module by adopting a gradient descent method, wherein the loss function of the parameters of the adaptive fusion module is as follows:
L=-yt+1log[Fθ(xt)]-(1-yt+1)log[1-Fθ(xt)]
wherein x istAs a visual feature at time t, Fθ(xt) Predicted behavior class at time t +1 for the adaptive fusion module, where θ represents the parameter of the adaptive fusion module, yt+1A real tag representing a behavior class at time t + 1;
Figure FDA0002711796570000032
a1is a fusion weight of the prediction results output by the intuition-based prediction module, a2Is a fusion weight of the prediction results output by the prediction module based on analytical reasoning;
wherein the step 1) and the step 2) are not in sequence.
9. A method for first view video behavior prediction using the first view video behavior prediction system of any of claims 1-7, comprising:
s1, acquiring the observed video of the first visual angle, and extracting visual features in the video;
s2, performing intuitive-based behavior prediction on the visual features extracted in the step S1 to obtain a first prediction result, and simultaneously performing analytical-reasoning-based behavior prediction on the visual features extracted in the step S1 to obtain a second prediction result;
and S3, calculating the fusion weight of the first prediction result and the second prediction result, and fusing the first prediction result and the second prediction result based on the fusion weight to obtain a final behavior prediction result.
10. A computer-readable storage medium, having embodied thereon a computer program, the computer program being executable by a processor to perform the steps of the method of claim 9.
11. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the steps of the method as claimed in claim 9.
CN202011059356.4A 2020-09-30 2020-09-30 First-view video behavior prediction system and method Pending CN112183391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011059356.4A CN112183391A (en) 2020-09-30 2020-09-30 First-view video behavior prediction system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011059356.4A CN112183391A (en) 2020-09-30 2020-09-30 First-view video behavior prediction system and method

Publications (1)

Publication Number Publication Date
CN112183391A true CN112183391A (en) 2021-01-05

Family

ID=73947082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011059356.4A Pending CN112183391A (en) 2020-09-30 2020-09-30 First-view video behavior prediction system and method

Country Status (1)

Country Link
CN (1) CN112183391A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327268A (en) * 2021-05-26 2021-08-31 中国科学院计算技术研究所 Self-constrained video activity prediction method and system
CN113705402A (en) * 2021-08-18 2021-11-26 中国科学院自动化研究所 Video behavior prediction method, system, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170289624A1 (en) * 2016-04-01 2017-10-05 Samsung Electrônica da Amazônia Ltda. Multimodal and real-time method for filtering sensitive media
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN110574046A (en) * 2017-05-19 2019-12-13 渊慧科技有限公司 Data efficient emulation of various behaviors
CN110991290A (en) * 2019-11-26 2020-04-10 西安电子科技大学 Video description method based on semantic guidance and memory mechanism
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170289624A1 (en) * 2016-04-01 2017-10-05 Samsung Electrônica da Amazônia Ltda. Multimodal and real-time method for filtering sensitive media
CN110574046A (en) * 2017-05-19 2019-12-13 渊慧科技有限公司 Data efficient emulation of various behaviors
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN110991290A (en) * 2019-11-26 2020-04-10 西安电子科技大学 Video description method based on semantic guidance and memory mechanism
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘昊淼: "面向物体语义理解的视觉表示学习", 《万方数据》 *
刘毅志 等: "融合音频单词与视觉特征的***频检测", 《中国图象图形学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327268A (en) * 2021-05-26 2021-08-31 中国科学院计算技术研究所 Self-constrained video activity prediction method and system
CN113705402A (en) * 2021-08-18 2021-11-26 中国科学院自动化研究所 Video behavior prediction method, system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
Niu et al. A review on the attention mechanism of deep learning
CN112487182B (en) Training method of text processing model, text processing method and device
WO2020228376A1 (en) Text processing method and model training method and apparatus
CN109299237B (en) Cyclic network man-machine conversation method based on actor critic reinforcement learning algorithm
EP3951617A1 (en) Video description information generation method, video processing method, and corresponding devices
US20210042800A1 (en) Systems and methods for predicting and optimizing the probability of an outcome event based on chat communication data
Lu et al. Multi-task learning using variational auto-encoder for sentiment classification
CN114330281B (en) Training method of natural language processing model, text processing method and device
CN112380835B (en) Question answer extraction method integrating entity and sentence reasoning information and electronic device
CN112183391A (en) First-view video behavior prediction system and method
CN111858898A (en) Text processing method and device based on artificial intelligence and electronic equipment
Chien et al. Hierarchical and self-attended sequence autoencoder
CN117521675A (en) Information processing method, device, equipment and storage medium based on large language model
EP4361843A1 (en) Neural network searching method and related device
EP4318322A1 (en) Data processing method and related device
CN116432019A (en) Data processing method and related equipment
Galitsky et al. Learning communicative actions of conflicting human agents
Dharaniya et al. A design of movie script generation based on natural language processing by optimized ensemble deep learning with heuristic algorithm
CN116414988A (en) Graph convolution aspect emotion classification method and system based on dependency relation enhancement
CN114239575B (en) Statement analysis model construction method, statement analysis method, device, medium and computing equipment
CN116150334A (en) Chinese co-emotion sentence training method and system based on UniLM model and Copy mechanism
Wang et al. Product feature sentiment analysis based on GRU-CAP considering Chinese sarcasm recognition
Zhou et al. An image captioning model based on bidirectional depth residuals and its application
CN116306612A (en) Word and sentence generation method and related equipment
CN114818690A (en) Comment information generation method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105