CN112183391A

CN112183391A - First-view video behavior prediction system and method

Info

Publication number: CN112183391A
Application number: CN202011059356.4A
Authority: CN
Inventors: 蒋树强; 张天予; 闵巍庆
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-05

Abstract

The invention discloses a first visual angle video behavior prediction system, which is used for performing behavior prediction according to an existing video, and comprises: the visual feature extraction module is used for extracting visual features in the existing video; the intuition-based prediction module is used for carrying out intuition-based prediction according to the visual features extracted by the visual feature extraction module to obtain a first prediction result; the prediction module based on the analytical reasoning is used for carrying out the prediction based on the analytical reasoning according to the visual characteristics extracted by the visual characteristic extraction module to obtain a second prediction result; and the self-adaptive fusion module is used for organically fusing the first prediction result and the second prediction result by adopting an attention mechanism to obtain a final behavior prediction result. The method organically combines the prediction based on intuition and the prediction based on analytical reasoning, effectively relieves the problem of 'vision gap', can more directly predict the next action of the human from the psychological angle, has higher accuracy, and provides more comprehensive support for the actual engineering application.

Description

First-view video behavior prediction system and method

Technical Field

The present invention relates to the field of video, and in particular, to the field of video behavior prediction, and more particularly, to a first view video behavior prediction system and method.

Background

Human perception and interaction with the outside world is accomplished through a self-centric perspective (i.e., a first perspective). With the development of smart wearable devices, recording video data at a first viewing angle has more and more application scenes, such as virtual reality, human-computer interaction and the like. Since the first perspective is more helpful for the intelligent system to understand the intention and purpose of human beings, it is important to analyze the behavior in the first perspective video, including behavior recognition (recognizing behavior that has already been completed) and behavior prediction (predicting behavior that has not yet occurred). Although the technology of behavior recognition is relatively mature, it is far from sufficient to rely on the behavior recognition technology in practical applications, for example, wearable power-assisted robots need to deduce the real intention of a user in time and help them to take action, so as to provide more detailed services, which requires research on behavior prediction technology. Some research efforts have been directed to first view video behavior prediction^[1]Some behavior recognition models are directly applied to the behavior prediction task, so that a certain prediction effect is achieved; some research works^[2]In consideration of the uncertainty of the behavior in the future, the behavior prediction is regarded as a multi-label classification task, and a corresponding loss function is designed, so that the prediction accuracy is improved; some research works^[3]Behavior prediction is subdivided into two parts, namely summarizing past information and predicting future information.

The first view behavior prediction requires predicting the behavior that may occur next from a segment of video that has already occurred. The existing methods mostly rely on visual features extracted from video data that have occurred, and due to the strong uncertainty in the future, video segments that have occurred and can be observed and non-occurring behaviors that are to be predicted and cannot be observed are visually observedThere is often a large difference, and we call this phenomenon the "visual gap". Most of the existing methods do not effectively utilize other information (such as a text mode) except for the visual mode, and the problem of 'visual gap' is difficult to relieve. Furthermore, first-view behavior prediction requires psychological or cognitive science from humans themselves due to the understanding directly related to human intent^[4]The prior art does not consider the equal angle exploration, so that the predicted behavior has larger deviation from the actually occurring behavior, and the prediction accuracy is not high.

The list of documents cited in the background section is as follows:

[1]Damen D,Doughty H,Maria Farinella G,et al.Scaling egocentric vision:The epic-kitchens dataset.In Proceedings of the European Conference on Computer Vision 2018:720-736.

[2]Antonino Furnari,Sebastiano Battiato,and Giovanni Maria Farinella.Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation.In Proceedings of the European Conference on Computer Vision 2018:389–405.

[3]Antonino Furnari and Giovanni Maria Farinella.What Would You ExpectAnticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention.In Proceedings of the IEEE International Conference on Computer Vision 2019:6251–6260.

[4]Robert Hamm.Clinical intuition and clinical analysis:expertise and the cognitive continuum.Professional judgment:Areader in clinical decision making 1988:78–105.

disclosure of Invention

Therefore, the present invention is directed to overcome the above-mentioned drawbacks of the prior art and to provide a new system and method for predicting video behavior from a first view.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided a first view video behavior prediction system for performing behavior prediction based on an existing video, the system comprising: the visual feature extraction module is used for extracting visual features in the existing video; the intuition-based prediction module is used for carrying out intuition-based prediction according to the visual features extracted by the visual feature extraction module to obtain a first prediction result; the prediction module based on the analytical reasoning is used for carrying out the prediction based on the analytical reasoning according to the visual characteristics extracted by the visual characteristic extraction module to obtain a second prediction result; and the self-adaptive fusion module is used for organically fusing the first prediction result and the second prediction result by adopting an attention mechanism to obtain a final behavior prediction result.

Preferably, the visual feature extraction module comprises a convolutional neural network.

Preferably, the intuition-based prediction module includes: an encoder for generating intermediate variables corresponding to an abstract representation of past information based on visual characteristics, the encoder being an long-and-short-term memory encoder in some embodiments of the invention; a decoder for reading information from intermediate variables generated by the long-and-short-term memory encoder to output future information about future behavior, in some embodiments of the invention, the decoder is a long-and-short-term memory decoder; a classifier for mapping future information output by the decoder to a behavior class to obtain a first prediction result, in some embodiments of the invention, the classifier is a fully-connected classifier.

Preferably, the prediction module based on analytical reasoning comprises: a recognition module for recognizing verbs and nouns in the visual features, the recognition module configured as a fully connected classifier in some embodiments of the present invention; a transition module for transitioning verbs and nouns, respectively, in the visual features identified by the identification module to a next state according to conditional probabilities, the transition module configured as a Markov transition matrix in some embodiments of the invention; a combining module for combining the next state of the verb and the noun transferred by the transfer module into a new behavior to obtain a second prediction result, in some embodiments of the invention, the combining module employs a priori knowledge strategy.

Preferably, the adaptive fusion module includes: the full-connection network is used for analyzing the first prediction result and the second prediction result to respectively obtain a normalization weighting coefficient of the first prediction result and a normalization weighting coefficient of the second prediction result; and the self-adaptive fusion module respectively sums the products of the first prediction result and the second prediction result with the corresponding normalized weighting coefficients to obtain a final behavior prediction result.

According to a second aspect of the present invention, there is provided a method for training the first-view video behavior prediction system according to the first aspect of the present invention, the method comprising the following three steps:

step 1) training an intuition-based prediction module, comprising:

inputting the visual characteristics of the current moment into the intuition-based prediction module, outputting the behavior category of the next moment, and optimizing the intuition-based prediction module by adopting a cross entropy loss function, wherein the loss function of the intuition-based prediction module is as follows:

x_tfor the visual characteristics at the time t,

a category of behavior at time t +1 predicted for an intuition-based prediction module, where θ_IAll parameters, y, representing an intuition-based prediction model_t+1A real tag representing a behavior class at time t + 1;

adopting a hidden knowledge storage strategy based on text pre-training to store hidden knowledge for the prediction module based on intuition, comprising the following steps: inputting the word vector of the previous behavior of the next moment behavior into an intuition-based prediction module, outputting a prediction result, and calculating a loss function:

then updating parameter theta of the intuition-based prediction module by adopting a gradient descent method_I，w_tIs y_t+1Previous behavior y of_tThe word vector of (2);

step 2) training a prediction module based on analytical reasoning, which comprises the following steps:

the visual characteristics are used as the input of a prediction module based on analytical reasoning, a cross entropy loss function optimization analysis reasoning module is adopted, and a gradient descent method is adopted to update the parameters of the prediction module based on analytical reasoning, wherein the loss function of the prediction module based on analytical reasoning is as follows:

wherein x is_tFor the visual characteristics at the time t,

a category of behavior predicted for the analytical reasoning-based prediction module at time t +1, where θ_AParameters representing predictive modules based on analytical reasoning, y_t+1A real tag representing a behavior class at time t + 1;

step 3) training the adaptive fusion module, which comprises:

taking the prediction result of the prediction module based on intuition and the prediction result of the prediction module based on analysis and reasoning as input, optimizing the adaptive fusion module by adopting a cross entropy loss function, and updating the parameters of the adaptive fusion module by adopting a gradient descent method, wherein the loss function of the parameters of the adaptive fusion module is as follows:

L＝-y_t+1log[F_θ(x_t)]-(1-y_t+1)log[1-F_θ(x_t)]

wherein x is_tAs a visual feature at time t, F_θ(x_t) Predicted behavior class at time t +1 for the adaptive fusion module, where θ represents the parameter of the adaptive fusion module, y_t+1Truthful representation of behavior class at time t +1A label;

a₁is a fusion weight of the prediction results output by the intuition-based prediction module, a₂Is a fusion weight of the prediction results output by the prediction module based on analytical reasoning;

wherein, the step 1) and the step 2) have no sequence and can run in parallel.

According to a third aspect of the present invention, there is provided a method for performing first-view video behavior prediction on the first-view video behavior prediction system according to the first aspect of the present invention, comprising:

s1, acquiring the observed video of the first visual angle, and extracting visual features in the video;

s2, performing intuitive-based behavior prediction on the visual features extracted in the step S1 to obtain a first prediction result, and simultaneously performing analytical-reasoning-based behavior prediction on the visual features extracted in the step S1 to obtain a second prediction result;

and S3, calculating the fusion weight of the first prediction result and the second prediction result, and fusing the first prediction result and the second prediction result based on the fusion weight to obtain a final behavior prediction result.

Compared with the prior art, the invention has the advantages that:

the method organically combines the prediction based on intuition and the prediction based on analytical reasoning, effectively relieves the problem of 'vision gap', can more directly predict the next action of the human from the psychological angle, has higher accuracy, and provides more comprehensive support for the actual engineering application.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

fig. 1 is a block diagram of a first-view video behavior prediction system according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a first view video behavior prediction system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the operation of an intuition-based prediction module according to an embodiment of the invention;

FIG. 4 is a schematic diagram of the operation of a prediction module based on analytical reasoning according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating the operation of an adaptive fusion module according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating behavior prediction performed by the first-view video behavior prediction system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

During research, the inventor finds that the first perspective behavior prediction directly relates to understanding of human intention, and the prior art rarely explores how to construct a behavior prediction model from the perspective of psychological or cognitive science. Through research on relevant work in the field of psychology, the inventor finds that the cognitive mode of human can be roughly divided into intuition and analytic reasoning, wherein intuition is a subconscious and integrated thinking process, depends on the storage of implicit knowledge and is difficult to explain by language; analytical reasoning is conscious, in a certain order and can be described in language. When complex problems related to prediction, decision and the like are processed, human intuition and analytical reasoning have advantages respectively, and prediction results with higher accuracy can be obtained by complementation. The inventors therefore constructed a first perspective behavior prediction model that incorporates intuitive and inferential analysis, which can be subdivided into three parts: the device comprises an intuition-based prediction module, an analysis-reasoning-based prediction module and an adaptive fusion module. In consideration of intuitive abstraction, the intuitive-based prediction module models intuition into a black-box-like encoder-decoder structure, preferably through two cascaded long-term memory networks for encoding past information and decoding future information, respectively, and introduces text information on the basis of the encoded information, and proposes a hidden knowledge storage strategy based on text pre-training to solve the problem of "visual gap", because parameters of the module are pre-trained by replacing input from visual features of a generated video with word vectors of generated behaviors, and because there is a time interval between an observable video and a behavior to be predicted, the time interval makes the visual information discontinuous in a time dimension, and the text information can avoid the problem, so richer and structured hidden knowledge can be stored. Considering that the analytical reasoning is more interpretable and tends to process information according to given rules, the inventors have specifically divided the prediction based on analytical reasoning into three steps: recognizing, transferring and combining, since the first perspective behavior is usually expressed in the form of (verb, noun) combinations, and the number of categories of verbs or nouns is often much smaller than the number of categories of behaviors, firstly, information of verbs and nouns is recognized from the input video, secondly, the verbs and nouns are respectively transferred to the next state through a Markov transfer matrix, and finally, considering that some invalid (verb, noun) combinations exist, the verbs and nouns are combined into the predicted behavior by utilizing a priori knowledge (namely, counting the symbiotic relationship of the verbs and nouns). Where the transfer process takes into account the probabilistic relationship between the past and the future, again helping to alleviate the "visual gap" problem. In consideration of the complementary effect of intuition and analytic reasoning, the invention constructs an adaptive fusion module, respectively calculates weights for prediction results based on intuition and analytic reasoning by introducing an attention mechanism, and organically fuses the prediction results of the intuition and the analytic reasoning to obtain a final prediction result.

According to an embodiment of the present invention, as shown in fig. 1, the present invention provides a first-view video behavior prediction system, which broadly comprises a visual feature extraction module for extracting visual features from an existing video; the intuition-based prediction module is used for carrying out intuition-based prediction according to the visual features extracted by the visual feature extraction module to obtain a first prediction result; the prediction module based on the analytical reasoning is used for carrying out the prediction based on the analytical reasoning according to the visual characteristics extracted by the visual characteristic extraction module to obtain a second prediction result; and the self-adaptive fusion module is used for organically fusing the first prediction result and the second prediction result by adopting an attention mechanism to obtain a final behavior prediction result.

As shown in fig. 2, when the prediction system of the present invention is used for prediction, for an input video segment, the prediction system of the present invention firstly uses a convolutional neural network to extract visual features in a visual feature extraction module, then predicts subsequent behaviors that do not occur later based on intuition and analytic reasoning, and finally adaptively fuses the prediction results of the two parts to obtain a final behavior prediction result. The core of the invention is three modules: the system comprises an intuition-based prediction module, an analysis and reasoning-based module and an adaptive fusion module. These three modules will be described in detail below with reference to specific embodiments:

first, prediction module based on intuition

The invention takes the intuitive abstraction into consideration, and models the intuition-based prediction module into a black box-like process: the device consists of an encoder, a decoder and a classifier. As shown in fig. 3, the input of the encoder is the visual characteristics of the observed video (i.e. the past information) and generates an abstract representation (i.e. the intermediate variables) about the past information, the decoder is used for reading the intermediate variables transmitted by the encoder and outputting information (i.e. the future information) about the future behavior, the classifier is used for mapping the information output by the decoder to a specific behavior class to obtain a specific prediction result, and for the convenience of understanding, the result predicted by the intuition-based prediction module is called a first prediction result. In psychology, intuition relies on the storage of implicit knowledge (difficult to interpret in language); in the intuition-based prediction module, the storage of the implicit knowledge is an updating optimization process of the parameters. Because the long-time and short-time memory network is good at storing the time sequence information, the invention respectively realizes the encoder and the decoder by adopting two cascaded long-time and short-time memory networks in the intuition-based prediction module, and realizes the classifier by the fully-connected network. According to one embodiment of the invention, the training process of the module: inputting visual characteristic x at time t_tWhen t +1 is outputTemporal behavior categories

Wherein theta is_IAll parameters representing the long-short memory encoder, the long-short memory decoder and the full-connected classifier are optimized by using a cross entropy function, and the loss function is as follows:

wherein y is_t+1A real label representing the behavior class at time t +1, the optimization goal of this module being such that L_IAt a minimum, the effect is to make the result of the intuitive prediction

As close as possible to the genuine label y_t+1. Particularly, relying only on the visual features that are not enough to store rich implicit knowledge, as shown in fig. 3, the invention adopts a text pre-training-based implicit knowledge storage strategy, namely, the rest is unchanged, and the visual features x are stored_tBy substitution of y_t+1Previous behavior y of_tWord vector w of_tAfter the actions of the long-time and short-time memory encoder, the long-time and short-time memory decoder and the full-connection classifier, a prediction result is obtained

From which a loss function is calculated

Updating parameter theta by using classical gradient descent algorithm in machine learning_IAnd realizing parameter training of the intuitive module. The method introduces a text mode on the basis of a visual mode, uses multi-mode information and can store richer implicit knowledge.

Prediction module based on analysis and reasoning

The present invention allows for analytical reasoning to be more interpretable, to be more inclined to process information according to given rules,the prediction process based on analytical reasoning is specifically divided into three steps: recognition, transfer and binding. Still referring to FIG. 2, since the first perspective behavior is generally expressed in the form of a (verb, noun) combination, the present invention first identifies verbs and nouns, such as "open" and "cupboard", respectively, from the visual characteristics of the input; then the verb and noun are transferred to the next state according to the conditional probability, for example, the verb will be transferred from "open" to "take" with the highest probability, and the noun will be transferred from "cupboard" to "cup" with the highest probability; finally, combining nouns and verbs into a new behavior and avoiding unreasonable combinations as much as possible, for example, the verb "take" and noun "cupboard", the noun "cup" and the verb "open" are unreasonable combinations, while the verb "take" and noun "cup" are reasonable combinations, and the probability of "take cup" combination can be increased and the probability of other combinations can be reduced by using priori knowledge. As shown in fig. 4, the training process of the present module can be summarized as follows: inputting visual characteristic x at time t_tRespectively recognizing verb information v at time t by using a fully-connected network_tAnd noun information n_tThen transfer the matrix T through Markov_vAnd T_nRespectively combine v with v_tAnd n_tShift to time t +1, v_t+1＝T_v ^Tv_tAnd n_t+1＝T_n ^Tv_nFinally, v is transformed using a priori knowledge_t+1And n_t+1Combined into a new behaviour, i.e. a_t+1＝softmax[η(v，a)v_t+1+η(n，a)n_t+1]Where η (v, a) and η (n, a) denote the prior probability of predicting an action a given verb v or noun, respectively. In the prediction module based on analytical reasoning, the Markov transfer matrix and the priori knowledge involved in combination are obtained by statistics, and no trainable parameter exists, so that the parameter theta of the fully-connected network_AThe parameters that the module needs to be trained. According to one embodiment of the invention, as with the intuition-based prediction module, the analytical reasoning-based prediction module is optimized using a cross-entropy function, the loss function being:

for input visual feature x_tAfter the full-connection classifier, the Markov transfer matrix and the prior knowledge are used, a prediction result is obtained

From this a loss function L is calculated_AUpdating parameter theta by using classical gradient descent algorithm in machine learning_AAnd realizing parameter training of the analysis reasoning module.

Adaptive fusion

The invention designs an adaptive fusion module based on an attention mechanism by taking the complementary consideration of intuition and analytic reasoning and aims to predict the result based on intuition

And predictive outcomes based on analytical reasoning

And (4) organic fusion. As shown in FIG. 5, first, the first step is to separately provide

And

sending the information into a full-connection network with trainable parameters (the full-connection network is a part of an adaptive fusion module and is used for calculating fusion weight), and respectively obtaining attention scores s corresponding to a prediction result based on intuition and a prediction result based on analytical reasoning₁And s₂Wherein, in the step (A),

and

then s is₁And s₂Normalization is carried out to obtain a weighting coefficient a₁And a₂(i.e., the fusion weight of the intuition-based prediction result and the analysis-inference-based prediction result) is calculated in the manner of

Finally, the prediction results of the two parts are multiplied by the corresponding weighting coefficients and summed to obtain the final prediction result

Wherein the parameter θ of the fully-connected network is a parameter that the adaptive fusion module needs to be trained, and as before, the adaptive fusion module is optimized using a cross entropy function, and the loss function is:

L＝-y_t+1log[F_θ(x_t)]-(1-y_t+1)log[1-F_θ(x_t)]

the input of the adaptive fusion module is an intuition-based prediction result

And predictive outcomes based on analytical reasoning

They respectively obtain fusion weight a through the action of the same fully-connected network₁And a₂Thereby obtaining the final prediction result

Calculating a loss function L ═ y_t+1log[F_θ(x_t)]-(1-y_t+1)log[1-F_θ(x_t)]And updating the parameter theta by using a classical gradient descent algorithm in machine learning to realize parameter training of the self-adaptive module. In the training process, the weight a is fused₁And a₂Can be dynamically adjusted (by optimizing the cross-entropy loss function adjustment because in the training, the new one is used

And

after the network is sent into the full-connection network, the parameter theta is updated, and different parameters theta generate different a₁And a₂) The whole framework can adaptively distribute different fusion weights for intuition or analysis reasoning, so that the prediction results based on intuition and analysis reasoning are organically fused, and the prediction accuracy is further improved.

For a better understanding of the present invention, the invention is described below with reference to an example.

According to an example of the present invention, as shown in fig. 6, feature extraction is performed through a convolutional neural network, information of opened cabinets is contained in a video segment which is known to occur, an intuition-based prediction module performs text pre-training through the information of opened cabinets and prediction of next behavior, and a first prediction result such as 'taking a cup, opening a cabinet, taking a bowl' is obtained; the predication module based on analysis and inference obtains verb information of 'open' and noun information of 'cabinet' through identification, the 'open' and the 'cabinet' are transferred based on prior probability, the next possible state of the 'open' is 'taking, closing and putting down' and the like, the next possible state of the 'cabinet' is 'cup, cabinet, bowl' and the like, the next states of the verb and noun are combined based on the prior probability, and second predication results of 'taking cup, closing cabinet and putting down cup' and the like are obtained; and taking the first prediction result and the second prediction result as the input of the self-adaptive fusion module, calculating respective weights of the first prediction result and the second prediction result by using a full-connection network of the self-adaptive fusion module, then carrying out self-adaptive fusion based on the weights to obtain possible final prediction results, wherein the possible final prediction results may be ' holding a cup, holding a bowl, putting down the cup, closing a cabinet, putting down the bowl ', and the like ', the probability of each prediction result is different according to different fusion weights, and the maximum probability is taken as the final behavior prediction result.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A first-view video behavior prediction system for performing behavior prediction based on an existing video, the system comprising:

the visual feature extraction module is used for extracting visual features in the existing video;

the intuition-based prediction module is used for performing vision-based prediction according to the visual features extracted by the visual feature extraction module to obtain a first prediction result;

the prediction module based on the analytical reasoning is used for carrying out the prediction based on the analytical reasoning according to the visual characteristics extracted by the visual characteristic extraction module to obtain a second prediction result;

and the self-adaptive fusion module is used for organically fusing the first prediction result and the second prediction result by adopting an attention mechanism to obtain a final behavior prediction result.

2. The system according to claim 1, wherein the visual feature extraction module comprises a convolutional neural network.

3. The system according to claim 1, wherein the intuition-based prediction module comprises:

an encoder for generating an intermediate variable corresponding to an abstract representation of past information based on visual characteristics;

the decoder is used for reading information according to intermediate variables generated by the long-time memory encoder so as to output future information about future behaviors;

and the classifier is used for mapping the future information output by the decoder to a behavior class to obtain a first prediction result.

4. The system according to claim 3,

the encoder is a long-time and short-time memory encoder, the decoder is a long-time and short-time memory decoder, and the classifier is a full-connection classifier.

5. The first-perspective video behavior prediction system of claim 1, wherein the analytical inference based prediction module comprises:

the recognition module is used for recognizing verbs and nouns in the visual features;

the transfer module is used for transferring verbs and nouns in the visual features identified by the identification module to the next state according to the conditional probability respectively;

and the combination module is used for combining the next state of the verb and the noun transferred by the transfer module into a new behavior to obtain a second prediction result.

6. The system according to claim 5, wherein said first-view video behavior prediction unit is further configured to,

the identification module is configured as a fully connected classifier;

the transition module is configured as a Markov transition matrix;

the combining module employs a priori knowledge strategy.

7. The system according to claim 1, wherein the adaptive fusion module comprises: the full-connection network is used for analyzing the first prediction result and the second prediction result to respectively obtain a normalization weighting coefficient of the first prediction result and a normalization weighting coefficient of the second prediction result;

and the self-adaptive fusion module respectively sums the products of the first prediction result and the second prediction result with the corresponding normalized weighting coefficients to obtain a final behavior prediction result.

8. A method for training a first view video behavior prediction system according to any of claims 1-7, the method comprising the following three steps:

step 1) training an intuition-based prediction module, comprising:

x_tfor the visual characteristics at the time t,

adopting a hidden knowledge storage strategy based on text pre-training to store hidden knowledge for the prediction module based on intuition, comprising the following steps: and (3) enabling the word vector of the previous behavior of the next moment behavior to belong to an intuition-based prediction module, outputting a prediction result, and calculating a loss function:

wherein x is_tFor the visual characteristics at the time t,

step 3) training the adaptive fusion module, which comprises:

L＝-y_t+1log[F_θ(x_t)]-(1-y_t+1)log[1-F_θ(x_t)]

wherein x is_tAs a visual feature at time t, F_θ(x_t) Predicted behavior class at time t +1 for the adaptive fusion module, where θ represents the parameter of the adaptive fusion module, y_t+1A real tag representing a behavior class at time t + 1;

wherein the step 1) and the step 2) are not in sequence.

9. A method for first view video behavior prediction using the first view video behavior prediction system of any of claims 1-7, comprising:

10. A computer-readable storage medium, having embodied thereon a computer program, the computer program being executable by a processor to perform the steps of the method of claim 9.

11. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the steps of the method as claimed in claim 9.