CN115719510A

CN115719510A - Group behavior recognition method based on multi-mode fusion and implicit interactive relation learning

Info

Publication number: CN115719510A
Application number: CN202211365228.1A
Authority: CN
Inventors: 邓海刚; 刘斯凡; 李成伟; 邹风山; 王传旭
Original assignee: Harbin Institute of Technology; Qingdao University of Science and Technology
Current assignee: Harbin Institute of Technology; Qingdao University of Science and Technology
Priority date: 2022-11-03
Filing date: 2022-11-03
Publication date: 2023-02-28

Abstract

The invention discloses a group behavior recognition method based on multi-modal fusion and implicit interactive relation learning, which comprises the steps of extracting static attitude features and dynamic light stream features of people; performing multi-modal feature fusion to obtain feature information representation containing both optical flow and attitude; based on the fused feature vectors, member interaction relation learning is carried out, so that important characters for behavior identification are selectively extracted, and the relation among the characters is better modeled; then extracting global characteristic information containing background information; and finally, utilizing the characteristics of the difference of the scene level information in the global view field to assist the character level characteristics to carry out group behavior identification together. According to the scheme, the individual behaviors in the group are extracted, the interactive relation among the group members is modeled and inferred, the purpose of predicting the group behaviors is finally achieved, the identification precision is high, and the method has high practical application value and popularization value in the group behavior identification field.

Description

Group behavior recognition method based on multi-mode fusion and implicit interactive relation learning

Technical Field

The invention belongs to the field of group behavior recognition, and particularly relates to a group behavior recognition method based on self-adaptive multi-mode fusion and implicit interactive relation learning.

Background

In recent years, human behavior recognition in videos has attracted attention in the field of computer vision, and human behavior recognition is widely applied in real life, such as intelligent video monitoring, abnormal event detection, sports analysis, understanding of social behaviors and the like, so that group behavior recognition has important scientific practicability and great economic value.

The scheme provides an identification frame of Multi-Stream Convolutional Neural Network group behaviors, carries out different training on different CNN streams respectively, and finally carries out decision-level fusion to predict the final group behaviors, wherein the identification frame is published by Sina Mokhtazardeh Azar et al, namely 'A Multi-Stream Convolutional Neural Network Framework for group behavior Recognition' (document 1). In addition, "outgoing relationship Network by set-extension managed Conditional access Fields for Group Activity registration" (document 2), and "Deep biological learning for rgb-d action registration" (document 3) issued by Hu et al also disclose a corresponding Group behavior Recognition method. However, for the algorithm framework disclosed in document 1, although the multi-modal feature representation is combined, the extracted information is enriched, but only the later decision-level fusion is adopted, and the feature multi-modal fusion is not considered, so that the problem of redundancy of feature information of multiple modalities is inevitably caused; document 2 aggregates the posture, the spatial position and the appearance characteristics in a cascade manner in a member feature extraction stage, but such simple operations generally have few or no associated parameters, and interaction of complementary information and significance representation of each modality cannot be learned; in document 3, the skeleton and RGB feature construction tensor structure cubes are multimodally fused to combine the character motion features, so that the respective modal features can be complementarily learned, but a huge amount of calculation cannot be avoided.

In recent years, in the member interaction relationship inference section, "Learning action relationship graphs for group action retrieval" published by Wu et al and "relational relationship mechanism for group action retrieval" published by Azar et al establish a person relationship topology using a graph convolution network and an intermediate representation such as a graph structure, and capture the appearance and positional relationship between members to perform relationship inference. Although the above methods achieve better prediction results, in order to extract spatial features, these explicit modeling methods require explicit person position nodes to establish a topological graph structure, and use a convolutional neural network as a basic building block, calculate hidden representations and output positions of all inputs in parallel, correlate signals from two arbitrary input or output positions, the required number of operations grows with the distance between positions, and the characteristics of multiple iterations make the network computation complex and huge.

Disclosure of Invention

Aiming at the problem that the characteristic information redundancy of multi-mode individual members in the group behavior recognition in the prior art is difficult to highlight due to the fact that only cascade addition and other operations are performed, the invention provides the group behavior recognition method based on self-adaptive multi-mode fusion and implicit interactive learning, and aims to accurately recognize the behavior of each individual in a group and deduce the group behavior by using the individual and the interactive characteristics among the individuals.

The invention is realized by adopting the following technical scheme: a group behavior recognition method based on multi-modal fusion and implicit interactive relation learning comprises the following steps:

step A, dynamic and static double-flow character feature extraction: extracting the static attitude feature and the dynamic optical flow feature of the character based on the character level feature extraction module;

step B, multi-modal feature fusion: connecting the static attitude characteristic and the dynamic optical flow characteristic single peak, and performing convolution compression to obtain a potential vector of the significant information, thereby obtaining the most representative characteristic information representation of each mode containing the optical flow and the fine attitude after fusion;

step C, member interaction relation learning: b, the fused feature information obtained in the step B is used for representing, based on a self-attention mechanism, the appearance similarity of the character features in pairs is calculated through the association strength, the characters important for behavior identification are selectively extracted, and the implicit vector representation among the group members calculated in the form of the sum of the attention weights is obtained;

step D, global feature extraction: based on a global feature extraction module, extracting global feature information containing background information aiming at an input video frame;

and E, identifying the group behaviors based on the step C and the step D.

Further, the step B specifically includes the following steps:

b1, firstly, connecting single unimodal features, and reducing the number of channels through an encoder convolution network to obtain a self-fused potential vector;

b2, reconstructing an initially connected vector from the self-fused potential vectors;

and step B3, minimizing the Euclidean distance between the original and reconstructed cascade vectors, and representing the intermediate vector as the fused multi-modal characteristic information.

Further, the step B1 is specifically realized by the following steps:

(1) Embedding the character static attitude characteristics and the dynamic optical flow characteristics obtained by the character level characteristic extraction module into vector spaces with the same dimensionality through Embedding linear mapping, and taking the character static attitude characteristics and the dynamic optical flow characteristics as single-peak input respectively;

(2) Given n d-dimensional multi-modal potential vectors, n is less than or equal to 3

Two modes respectively represent character static attitude feature and dynamic optical flow feature vector, and first cascade operation is carried out to obtain

Wherein

(3) Then through the coding part to obtain

Reduce its dimensionality to t:

in the coding part, after the multi-mode is cascaded through a Linear layer

Compressing the dimension of the image to be a dimension of monomodal initialization, and then performing nonlinear mapping through a Tanh activation function; then, the Linear transformation of the second time is carried out to compress the characteristic dimension, and then the Relu function is carried out to activate, wherein the Linear transformation of the second time is carried out at the moment

Referred to as fused latent features.

Further, the step B2 is specifically realized by the following steps:

latent features to be fused by decoding the transformed part

Reconstructing the initial connected vector to obtain

Computing

And with

Loss F between _tr To guide the network iterative optimization, so that the learned potential feature representation can represent the significant information of each mode most.

Further, in the step B3, an MSE loss function is adopted to guide the learning of the fusion network, and the intermediate vector is used

As a fused multi-modal feature information representation.

Further, the step C specifically includes the following steps:

in the first stage, the score of the degree of association between each character and other participants is calculated by matching query Q with a key value set K, all three representations (Q, K, V) are calculated from an input sequence S through linear projection, and S is a group of character features S = { S } obtained by a character feature extractor after multi-modal adaptive fusion _i I =1, \\8230 |, N }, with a (S) = a (Q (S), K (S), V (S));

in the second stage, normalization processing is carried out on the results of the association degree of each person and other participants obtained after the dot product of the query Q and the K is calculated, and a similarity set a is obtained _n N =1,2, \8230, n characters, the sum of which is 1;

and in the third stage, multiplying the similarity vectors obtained by the two-stage normalization by V respectively to obtain a final weighting and attention matrix for final classification identification.

Further, in the step D, I3D is used as a backbone network, and the RGB video clip is used as an input, a T frame centered on the annotation frame is selected, and a deep spatiotemporal feature map extracted from the final convolutional layer is used as a rich semantic representation describing the entire video clip.

Further, in the step E, the output of the step D is fed to a fully connected layer, the output of the fully connected layer is connected with the output of the member interactive learning module of the step C, and is transmitted as an input to the fully connected layer with a classification layer, and the characteristics of scene level information difference in the global field of view are utilized to assist the character level characteristics to perform group behavior recognition together;

and setting two classifiers for generating group behavior category scores and individual action category scores respectively during identification and classification, obtaining final identification and classification results of group behaviors and individual actions through network learning, and selecting cross entropy loss to guide optimization in the whole network training process.

Furthermore, for an input video frame, the input video frame is divided into three branches including a static branch, a dynamic branch and a global branch; static branch and dynamic branch are connected to people's rank characteristic and draw the module, and global branch is connected to global characteristic and draws the module, in step A, adopt the mode of double-flow characteristic extraction to enrich individual characteristic, specifically include:

(1) Extracting character static posture characteristics by the static branch trunk network: taking a bounding box surrounding a person in a video frame as the input of a posture estimation model HRNet, and predicting key nodes;

(2) Extracting dynamic features of optical flow by a dynamic branch backbone network: by adopting optical flow representation in the I3D network, firstly converting input sequence frames into continuous optical flow frames, then processing stacked frame sequences through an expanded 3D convolution network, adding a ROIAlign layer to project coordinates onto a feature map of the frame, and further extracting optical flow features of each human bounding box in the input frame.

Compared with the prior art, the invention has the advantages and positive effects that:

firstly, the method achieves the purpose of predicting group behaviors finally by extracting individual behaviors in a group and modeling and reasoning the interaction relationship among the group members, designs a dynamic and static multi-modal feature extraction module (task level feature extraction module), compared with the prior art, not only extracts the dynamic features of the characters represented by the optical flow in 3DCNN, but also considers the high correlation of the positions and the motions of the joints of the body of most people in the group behaviors, extracts the posture features of the members through HRNet to refine the fine representation of the actions of the characters, and enriches the character feature information through the fusion of the double-flow representation of the 2D posture features and the 3D optical flow dynamic features;

secondly, in a multi-modal feature fusion stage, a lightweight component design is adopted, a self-adaptive multi-modal fusion module is used for learning the collaborative sharing expression of each mode, multi-modal data are effectively combined, meanwhile, the problem of huge calculation amount in the previous work is solved through the lightweight design, and compared with a simple fusion method of addition and cascade, the average recognition precision of an experiment on a volleyball data set is respectively improved by 4.6% and 3.7%;

thirdly, in the member interaction relationship reasoning module, a dependency self-attention mechanism of a Transformer network in an NLP task is utilized, the dependency relationship among words can be well modeled, and the member interaction relationship reasoning module has the advantage of no need of repetition or recursion. From the perspective of model expansibility, a standard Transformer encoder is directly applied to the group behavior recognition interactive reasoning module to act on participant interaction, no specific change aiming at a visual task is made, and the effect of the method in the visual field is exerted; the experimental result shows that compared with a model needing definite space and time constraints, the method has the advantages that the relation among characters is better modeled, and group behavior recognition is carried out by combining the information of character levels;

fourthly, the problem of confusion of similar group behaviors is solved based on the global feature extraction module, and the recognition error is reduced. For example, the two types of behaviors such as "walking" and "crossing" in Collective Activity data set (Collective Activity databases CAD) usually occur in open air, the actions of people are walking essentially, the categories of group behaviors are difficult to distinguish through the individual motion characteristics, and the identification is often confused. However, the crossing often occurs in the places with important scene information such as zebra crossings and traffic lights at crossroads, and the global scene of walking is mostly park paths and the like. Therefore, the global feature extraction module is designed, recognition is assisted through scene context features of different behaviors, and experiments prove that the average recognition precision of the added module in a CAD ablation experiment is improved by 2.3%.

Drawings

Fig. 1 is a schematic flowchart of a group behavior identification method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an adaptive multi-modal feature fusion module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a push module for member interaction according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an attention mechanism according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as described herein, and thus, the present invention is not limited to the specific embodiments disclosed below.

The embodiment provides a group behavior recognition method based on adaptive multi-modal fusion and implicit interactive relationship learning, a schematic diagram of a principle framework of the group behavior recognition method is shown in fig. 1, a design framework comprises a character level feature extraction module, a multi-modal feature fusion module, a member interactive relationship learning module and a global feature extraction module, and the method specifically comprises the following steps:

step A, dynamic and static double-flow character feature extraction: extracting the static attitude characteristics and the dynamic optical flow characteristics of the character based on a character level characteristic extraction module;

step B, multi-modal feature fusion: connecting the static attitude feature and the dynamic optical flow feature single peak, performing convolution compression to obtain a latent vector of significant information, and performing iterative computation to strengthen the significant feature through the guidance of a loss function of the latent vector and original feature information to obtain a feature information representation which contains optical flow and most representativeness of each mode of fine attitude after fusion, and simultaneously removing redundant information and useless feature information which may exist in the two modes, thereby reducing the dimensionality of the feature vector after fusion and reducing the computation consumption for the learning of downstream interactive relationship modeling;

step C, member interaction relation learning: b, based on the member feature vectors of the converged multi-modal information obtained in the step B, calculating the appearance similarity of the character features through the association strength by using a self-attention mechanism in a transform encoder network, wherein the distributed similarity calculation scores can spontaneously find the relationship between the characters, the process can be regarded as an implicit interactive relationship learning process, characters important for behavior identification are selectively extracted, the relationship between the characters is better modeled, and the implicit vector representation between the group members calculated in the form of the sum of the attention weights is obtained;

step D, global feature extraction: extracting global feature information containing background information through a 3DCNN based on a global feature extraction module;

and E, based on the step C and the step D, feeding the output of the step D to a complete connection layer with a plurality of neurons and a tanh activation function, connecting the output of the layer with the transform encoder output of the step C, and transferring the output of the layer as input to the complete connection layer with a Softmax classification layer, wherein the character level characteristics are assisted to carry out group behavior recognition together by utilizing the characteristics of scene level information difference in the global visual field.

The following detailed description of the embodiments of the present invention is provided in conjunction with the detailed description of the embodiments of the invention:

1. extracting dynamic and static double-flow character features;

in this embodiment, a character level feature extraction module is designed, and features of individuals are enriched by adopting a dual-stream feature extraction manner, as shown in fig. 1, an input video frame is divided into three branches including a static branch, a dynamic branch and a global branch; the static branch and the dynamic branch are connected to the character level feature extraction module, and the global branch is connected to the global feature extraction module:

(1) The static branch trunk network is used for extracting figure 2D posture characteristics:

the human body motions in the video all relate to the motions of body joints such as hands, arms and legs, and the human body posture prediction is not only suitable for fine motion recognition performed in sports activities (such as the ball catching and catching in the Volleyball data set) but also suitable for daily motions (such as walking and talking in CAD). The character features in the video need to capture not only the position of the joints, but also the timing dynamics of the joints. In order to obtain the joint positions of the person, the present embodiment adopts a pose estimation model HRNet to accept a bounding box surrounding the person in the video frame as an input and predict the key node positions. For example, in an experiment, the last layer feature map of the network can be used, and a minimum network pos _ hrnet _ w32 trained on the key points of the COCO dataset can be used to achieve good performance.

(2) The dynamic branch backbone network is responsible for modeling 3D temporal dynamics:

studies have shown that 3DCNN trained with enough available data can construct a strong spatio-temporal representation for motion recognition, because the separate pose network in the static branch cannot capture the motion of the joint from a single frame, the present embodiment employs an I3D network, with the optical flow representation in the I3D network, first converting the input sequence frame into consecutive optical flow frames, and then processing the stacked frame sequence Ft, T =1, \8230;, T frames through a dilated 3D convolution network. Because 3DCNN is very expensive in computation, this embodiment adds a roiign layer to project coordinates onto the feature map of the frame, and further extracts optical flow features of each human bounding box in the input frame.

In the embodiment, a 2D posture network and a 3DCNN backbone network are used for respectively extracting character features in continuous video frames, a character level feature extraction module is designed, not only are optical flow in the 3DCNN extracted to represent character dynamic features, but also the position and motion height correlation of most of people in body joints in group behaviors are considered, member posture features are extracted through HRNet to refine the fine representation of character motions, and rich static and dynamic character feature information is expressed through the fusion of double-flow representation of the 2D posture features and the 3D optical flow dynamic features.

2. Multimodal feature fusion

Most existing fusion techniques, such as cascading and TFN, involve deterministic operations to construct a joint multimodal representation. For example, TFN uses a 3-fold cartesian product of unimodal features for prediction, which focuses more on learning rich unimodal features, resulting in inefficient use of multimodal information.

The embodiment designs a lightweight self-adaptive fusion technology, and the basic fusion principle is as follows: the unimodal embedded cascade is used as an initial step, a cascade vector is not used as a final fusion result, the most representative key information of the multiple modes is extracted by reconstructing input information, the 'staticity' of the existing fusion method is relieved, the input of the multiple modes is effectively combined, and the 'staticity' of the existing fusion method, the expressed skin shallowness and the cost problem are relieved.

As shown in fig. 2, the adaptive multi-modal feature fusion module extracts multi-modal features by maximizing correlation between multi-modal inputs, designs an encoder and a decoder, recovers original data by learning to "compress" information of different modalities, recovers original data by using vectors with small information amount but including all key information, performs back propagation by prediction errors, and gradually improves accuracy, specifically including the following steps:

1. firstly, connecting single unimodal features, reducing the number of channels through a convolutional network of an encoder to achieve the effect of compressing vector dimensionality, and obtaining a self-fused potential vector;

(1) Embedding the character static attitude characteristics and the dynamic optical flow characteristics obtained by the character level characteristic extraction module into vector spaces with the same dimensionality through an Embedding linear mapping operation, respectively taking the character static attitude characteristics and the dynamic optical flow characteristics as single-peak input, and setting the dimensionality d =256 in specific operation;

(2) Given n (n ≦ 3) d-dimensional multi-modal potential vectors,

two modes respectively represent static attitude feature and dynamic optical flow feature vector, and cascade operation is firstly carried out to obtain

Wherein

The cascaded dimension becomes 512;

(3) Then obtaining the code part T

Reduce its dimensionality to t:

in the coding part, after the multi-mode is cascaded through a Linear layer

The dimension of (2) is compressed to be the dimension d =256 of the monomodal initialization, and then nonlinear mapping is carried out through a Tanh activation function; then, the Linear transformation of the second time is carried out to compress the characteristic dimension to t =128, and then the Relu function is carried out for activation, wherein the time is

Referred to as the fused latent features (the most representative feature information representation). By limiting dimensionality reduction, learning the less-than-perfect fusion representation enables the encoder to capture the most salient features in the training data, and meanwhile, useless redundant information is removed.

2. Then attempting to reconstruct the initially connected vector from the self-fused potential vectors;

the specific operation requires that the input signal be reproduced as much as possible, and in order to realize the reproduction, the network must automatically capture the most important factors that can represent the input data; the implementation will be fused by decoding the transform part F

Initial join vector derivation for latent feature reconstruction

Computing

And

Firstly inputting potential feature vector in F transformation

Dimension is increased to 256 through Linear transformation, then Relu activation function is adopted, then second Linear transformation is carried out to change the dimension to 512 which is originally cascaded, and reconstruction is close to that

Is/are as follows

Herein, the

And

are not identical.

3. The Euclidean distance between the original concatenated vector and the reconstructed concatenated vector is minimized, and the training of the model can stimulate the model to compress information without losing any basic clues, so that the correlation between the self-fusion potential vector and the concatenation potential vector is effectively increased, and specifically:

learning of the fusion network is guided by the loss function, in this embodiment, a Mean square Error (Mean Squared Error MSE) loss function is adopted, and a smaller loss function difference value represents a better reconstruction effect, and also represents a compressed and fused multi-modal potential feature

The more significant the multi-modal information learned adaptively, the MSE loss is as follows:

for an adaptive multi-modal fusion network, intermediate vectors are combined

As a fused multimodal representation.

The embodiment provides a lightweight multi-modal fusion strategy, which learns and compresses information of different modes, calculates loss between initial features and similar initial features reproduced through learning through encoding (encoder) and decoding (decoder), adaptively captures effective information between 2D and 3D double-current cross-modal character features, and relieves information redundancy.

3. Member interaction learning

In the member interaction relationship learning module, a Transformer encoder architecture is applied to challenging group behavior recognition characters in the video to learn and refine and aggregate the relationship characteristics at the character level, as shown in fig. 3. The encoder receives the input sequence processed by a stack of identical layers, consisting of a multi-headed self-attention layer and a fully connected feed-forward network. The self-attention mechanism is an important component in a transform encoder network, in sequence modeling in the natural language field, the mechanism attaches more importance to the most relevant words in a source sequence, and is extended to group behavior recognition, so that the mechanism can be used for finding the appearance similarity relation between relevant characters in group behaviors, and the characteristic information of each participant is enhanced based on other participants in a video, so that the key characters in the group behaviors are more concerned, and the whole process does not need any space constraint.

In this example we denote self-attention as a, a function representing a weighted sum of values V, whose process of learning group membership in particular is shown in fig. 4:

first, in the first stage, the score of the relevance of each person to other participants is calculated by matching the query Q with the key value set K, and all three representations (Q, K and V) are projected through linear projectionCalculated from the input sequence S, S is a group of character features S = { S } obtained by the character feature extractor after multi-modal adaptive fusion _i I =1, \\8230 |, N }, with a (S) = a (Q (S), K (S), V (S)). Q and each K value set represent each participant characteristic vector, F is a matching function (figure 4) for calculating the appearance similarity of the character features in pairs, the function can have different forms, the embodiment uses dot products, the projection length of one vector on the other vector is represented, the similarity between the two vectors can be reflected, the higher the similarity and the relevance between the character feature vectors are, the higher the calculated score is, the dot product calculation is faster and more space-saving in practice, and the method can be realized by using a highly optimized matrix multiplication code. In form, attention with dot product matching functionality can be written as:

d is the dimension of query Q and key value K.

In the second stage, softmax normalization processing is carried out on the result of the association degree of each person and other participants obtained after the Q and K dot product calculation, and a similarity set a is obtained _n N =1,2, \8230, n characters, the sum of which is 1. The purpose of the method is to convert the learned value of the association degree of each character for the rest participants into 0-1, and represent the strength of the association degree between the character and the other participants in a probability form.

And in the third stage, the similarity vectors obtained by Softmax in the two stages are respectively multiplied by V to obtain a final weighting and attention matrix, and the matrix can be regarded as an interaction relation among the group members implicitly learned through a self-attention mechanism and used for final classification and identification.

In the present embodiment, due to the feature s _i Without following any particular order, the self-attention mechanism is better suited than RNN and CNN for refinement and aggregation of these features. The Transformer encoder may pass through s _i The implicit utilization of the spatial relationships between people by position coding of (2) alleviates explicit modeling by relying only on a self-attention mechanismA problem is solved. Model using center point (x) _i ,y _i ) Representing individual character characteristics s _i Each bounding box b of _i And the center point is encoded using the same function PE (Position Encoding) as in the document orientation is all you needed.

To sum up, the present embodiment provides an implicit interaction relationship inference module for the problem of inaccurate description of long-distance member interaction relationships and the problem of finding key character modeling, on the premise that no explicit space or time modeling is needed, and uses a self-attention mechanism represented by a sequence by associating different positions of a single sequence in a transform encoder, and calculates the interaction between the pair of feature vector appearance similarities to learn the group members through the association strength, that is, the feature vector appearance similarity between the learning characters and the participant space structure information captured by the position codes, and assigns similarity calculation scores to selectively extract key character information important for behavior recognition, and simulates and infers the dependency relationship between the characters, so as to selectively focus on the participants containing important information for behavior recognition, and simultaneously, the appearance and position relationship between the characters can be implicitly modeled without depending on any prior space or time structure.

4. Global feature extraction

Similar to the dynamic feature extraction mentioned in the feature extraction link (no roiign module exists during global feature extraction, and during dynamic and static dual-stream character feature extraction, only character features are concerned by the character bounding boxes in a data set through roiign, similar to cutout, and the global features are overall data features containing background information).

5. Group behavior identification:

to integrate the models (character-level model and global scene model), the output of the global feature extraction module is fed to a fully connected layer with a number of neurons and tanh activation functions, the output of which is connected to the transform output and passed as input to the fully connected layer with the Softmax classification layer for recognition.

In this embodiment, two classifiers are set during recognition and classification to generate a group behavior category score and an individual behavior category score, a final recognition and classification result of the group behavior and the individual behavior is obtained through network learning, and cross entropy loss is selected in the whole network training process to guide optimization:

wherein

And

which represents the cross-entropy loss of the entropy,

and

is a group behavior score and an individual action score, and y _g And y _a A true label representing the target group behavior and the individual behavior, λ is a hyper-parameter that balances these two terms.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims

1. The group behavior recognition method based on multi-modal fusion and implicit interactive relation learning is characterized by comprising the following steps of:

and E, identifying the group behaviors based on the step C and the step D.

2. The group behavior recognition method based on multi-modal fusion and implicit interactive relationship learning of claim 1, wherein: the step B specifically comprises the following steps:

3. The group behavior recognition method based on multi-modal fusion and implicit interactive relationship learning of claim 2, wherein: the step B1 is specifically realized by the following manner:

(1) Embedding the character static attitude characteristics and the dynamic optical flow characteristics obtained by the character level characteristic extraction module into vector spaces with the same dimensionality through Embedding linear mapping, and taking the vector spaces as single-peak input respectively;

Two modes respectively represent character static attitude characteristics and dynamic optical flow characteristic vectors, and firstly, cascade operation is carried out to obtain

Wherein

(3) Then through the coding part to obtain

Reduce its dimensionality to t:

in the coding part, after the multi-mode is cascaded through a Linear layer

Referred to as fused latent features.

4. The group behavior recognition method based on multi-modal fusion and implicit interactive relationship learning of claim 3, wherein: the step B2 is specifically realized by the following means:

latent features to be fused by decoding the transformed part

Reconstructing the initial connecting vector to obtain

Computing

And

5. The group behavior recognition method based on multi-modal fusion and implicit interactive relationship learning of claim 4, wherein: in the step B3, the MSE loss function is adopted to guide the learning of the fusion network, and the intermediate vector is processed

As a fused multi-modal feature information representation.

6. The group behavior recognition method based on multi-modal fusion and implicit interactive relationship learning of claim 1, wherein: the step C specifically comprises the following steps:

in the first stage, the score of the relevance degree of each character and other participants is calculated by matching query Q with a key value set K, all three expressions (Q, K and V) are calculated from an input sequence S through linear projection, and S is obtained by a character feature extractorA group of character features S = { S } after multi-modal adaptive fusion _i I =1, \\8230 |, N }, with a (S) = a (Q (S), K (S), V (S));

in the second stage, normalization processing is carried out on the results of the association degree of each person and other participants obtained after the dot product of the query Q and the K is calculated, and a similarity set a is obtained _n N =1,2, \ 8230, n characters, the sum of which adds up to 1;

and in the third stage, multiplying the similarity vectors obtained by the two-stage normalization by V respectively to obtain a final weighting and attention matrix for final classification and identification.

7. The group behavior recognition method based on multi-modal fusion and implicit interactive relationship learning of claim 1, wherein: in the step D, I3D is used as a backbone network, RGB video clips are used as input, T frames with annotation frames as centers are selected, and deep space-time feature mapping extracted from the final convolution layer is used as rich semantic representation for describing the whole video clips.

8. The group behavior recognition method based on multi-modal fusion and implicit interactive relationship learning of claim 1, wherein: in the step E, the output of the step D is fed to a complete connection layer, the output of the complete connection layer is connected with the output of the member interactive learning module in the step C and is transmitted to the complete connection layer with a classification layer as input, and the character level characteristics are assisted to carry out group behavior recognition together by utilizing the characteristics of scene level information difference in the global view;

9. The group behavior recognition method based on multi-modal fusion and implicit interactive relationship learning of claim 1, wherein: for an input video frame, dividing the input video frame into three branches including a static branch, a dynamic branch and a global branch; the static branch and the dynamic branch are connected to the character level feature extraction module, the global branch is connected to the global feature extraction module, in the step A, the individual features are enriched by adopting a double-flow feature extraction mode, and the method specifically comprises the following steps: