CN109446872B

CN109446872B - Group action recognition method based on recurrent neural network

Info

Publication number: CN109446872B
Application number: CN201810971833.0A
Authority: CN
Inventors: 舒祥波; 严锐; 唐金辉
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2022-04-19
Anticipated expiration: 2038-08-24
Also published as: CN109446872A

Abstract

The invention provides a population motion identification method, which comprises the following steps: at each moment, extracting the CNN characteristics of the individual as static characteristic representation thereof; modeling the individual dynamics from the static representation of the individual using a long-short term memory model; modeling Long Motion; modeling the interaction dynamics among individuals; modeling Flash Motion.

Description

Group action recognition method based on recurrent neural network

Technical Field

The invention provides a computer vision and multimedia technology, in particular to a group action recognition method based on a recurrent neural network.

Background

Motion recognition, which aims to enable computers to understand the motion occurring in video segments, is receiving increasing attention in the field of computer vision and multimedia. Human activities can be broadly divided into three categories, single person actions, interactive actions and group actions, depending on the number of action participants. Much previous work has focused more on the study of single person motion recognition and made good progress. In addition to single-person actions, real scenes tend to contain more interactive actions (e.g., "handshaking") and group actions (e.g., "queue", "cross-road"). In the interactive scenario, at least two people are interacting at the same time. In the context of group actions, an activity describes a more complex scenario/event involving individual person behavior and various other interactions (e.g., group-person and group-group interactions). In general, group activity recognition is a more challenging task than single person motion recognition and interactive motion recognition.

Disclosure of Invention

The invention aims to provide a group action recognition method based on a recurrent neural network, which comprises the following steps:

step 1, inputting a video segment to be detected, taking a middle T frame, and detecting all moving individuals in each frame;

step 2, extracting the spatial characteristics of all the motion individuals at each moment by using a convolutional neural network;

step 3, establishing a Single-Person LSTM model, and providing individual spatial characteristics to the Single-Person LSTM model to capture individual time dynamic characteristics;

step 4, according to the sequence of the moving time of the individual in the whole activity process, the space-time characteristics of all the individuals are transmitted to Interaction Bi-LSTM to capture context information;

step 5, assigning all hidden states in the Interaction Bi-LSTM with dynamic weights, finally integrating the hidden states into an Aggregation LSTM, and connecting multiple groups of Aggregation states into input of a softmax layer at a corresponding moment;

and 6, averaging the softmax scores at all the moments to serve as a final prediction probability vector of the group activity recognition.

Compared with the prior art, the invention has the following advantages: the invention explores a new concept of 'One to Key' and gradually integrates the space-time characteristics of each Key role in different degrees. The present invention focuses on two types of key roles, one is steady movement throughout the process (long movement time) and the other is violent movement occurring at a certain moment (but closely related to group action). On the basis, a novel participatory time dynamic model (PC-TDM) is provided for identifying group actions, and the novel participatory time dynamic model mainly comprises a 'One' network and a 'One to Key' network. In particular, the goal of "One" networks is to dynamically model individuals. The One to Key network inputs the output of the One network into a bidirectional LSTM (Bi-LSTM) in sequence according to the individual moving duration. Subsequently, each output state of the Bi-LSTM is weighted and aggregated. Experimental results show that the method can remarkably improve the group action recognition performance.

The invention is further described below with reference to the accompanying drawings.

Drawings

Fig. 1 is a framework diagram of the present invention.

FIG. 2 is a capture diagram of Long Motion.

FIG. 3 is a capture diagram of a Flash Motion.

FIG. 4 is a schematic flow chart of the method of the present invention.

Detailed Description

A group action recognition framework based on a recurrent neural Network comprises two sub-networks of 'One' Network (individual space-time Network) and 'One to Key' Network (Key participant time Network).

1. "One" network: individual spatiotemporal networks

Step 1, at each moment, extracting individual CNN (volumetric neural networks) features as static feature representation thereof.

Step 2, modeling the individual dynamics from the static representation of the individual using a long short term memory model (LSTM) (referred to herein as Single-Person LSTM). Formally, with X ═ X₁,x₂,...,x_TIn which x_tIs the spatial CNN feature at time step t extracted from the pre-trained CNN model. Input gating i_tForgetting to gate f_tOutput gate o_tAnd an input modulation gate g_tC of memory cell_tThe single person LSTM is defined as follows,

i_t＝σ(W_ixx_t+W_ihh_t-1+b_i)；

f_t＝σ(W_fxx_t+W_fhh_t-1+b_f)；

o_t＝σ(W_oxx_t+W_ohh_t-1+b_o)；

g_t＝σ(W_gxx_t+W_ghh_t-1+b_g)；

σ () is a sigmoid function; w_*xAnd W_*hIs a weight matrix; b_*Is a bias vector;

representing element-by-element multiplication; and h_tIs a hidden state (containing the time dynamic characteristics of the individual), which contains the dynamic information of the individual at the time t,

is an activation function.

2. "One to Key" network: key participant time network

And 3, modeling the Long Motion. Long Motion is the Motion of a participant with continuous Motion throughout the process. The longer a person moves, the more important is the role he/she plays. To measure the Motion duration of the Long Motion in the entire video clip, the average Motion intensity of each person is measured by superimposing the optical stream images and calculating their average value, as shown in fig. 2. More formally, given a T-frame video clip, each frame has a resolution of w × h, respectively

And

representing the horizontal and vertical displacement vectors at point (u, v) (u 1, 2.., w; v 1, 2.., h). First, to be continuous T frames

And

together, the following components are added:

wherein i 1,2, T, followed by SF^k(u, v, c) (c ═ 1, ·,2T), representing T-frame continuous motion information for the k-th individual at point (u, v). Accordingly, the long motion intensity of the kth individual is defined:

wherein

Represents the action intensity, MI, of the kth individual at time t^kIndicating the kth individual's full motion intensity. Apparently, MI of one person^kThe larger this means that the person is more involved in group activities throughout the process.

And 4, modeling the interaction dynamic state among individuals. Many of the previous works have been modeled in turn by roughly using the spatial location of all people. This ignores the fact that some closely located people are sometimes not relevant. It is clear that a person who is constantly moving (e.g., "moving", "jumping") has a lot of time to interact with others at many times. Therefore, mobile individuals with longer movement times should adopt the participation modeling as early as possible. Formally, according to the MI of each person^kValues, whose features are sorted in descending order and used as the input sequence for the LSTM. Considering that the Interaction between two people is bidirectional, a new Interaction Bi-LSTM is used to model the Interaction sequence, rather than the traditional unidirectional LSTM. At time t, InteractThe ion Bi-LSTM unit calculates the forward feedback sequence

And a backward feedback sequence

K persons were iterated from two directions K → 1 and K → 1 → K, respectively. Output sequence

Can be expressed as:

wherein H () is implemented by the definition of LSTM in step 2),

and

is a weight matrix; b_*Is a bias vector; o denotes a sampling operation. Different from the conventional Bi-LSTM splicing pre-and post-sequence, by pairing

And

sampling in each feature dimension to obtain a final output sequence representation

This can reduce not only redundant information but also computational overhead.

And 5, modeling the Flash Motion. In addition to long movements, some people do not have steady movements throughout their activities, and at some important moment they have strong movements, i.e. Flash Motion. These movements also provide important discriminatory information for identifying group activities. Taking the "left team" activity of a volleyball game as an example, as shown in fig. 3(a), several people participate in the volleyball game in a more intensive flashing motion. Their movements are closely related to the "ball left" activity, providing important information for understanding this activity. Since Flash Motion varies over time, assigning different weighting factors to discover key participants is considered. A straightforward approach is to compute each person's weight from the optical flow values between two consecutive frames. However, some flash movements that occur at important moments may not be related to team activity.

In this invention, an Aggregation LSTM is constructed, the weight factor of each person is learned through the individual action characteristics of her/his, and then the output state of Interaction Bi-LSTM is gradually gathered. If the individual actions are more consistent with the group activity, then the learned weighting factor will be larger, and vice versa. Dividing the overall group activity of K individuals into N_gIdentifying a group, wherein g 1,2_g. Starting index S of group g individuals_gAnd an end index E_gIt can be defined as follows,

S_g＝(g-1)·K/N_g+1；

E_g＝g·K/N_g,

for the kth individual of the g group population in the video segment, a weighting factor is learned

To control the output state of Interaction LSTM at t

To capture the intensity of flash motion:

wherein

W_heIs a weight parameter matrix, b_eIs a bias vector, exp (×) is an exponential function. A potential representation of each person in the g-th group at time t is then obtained

Then, the Aggregation LSTM unit accepts hidden layer data at the previous moment

And characteristic data of current time

Can be simply expressed as follows:

wherein Z_tgIs a characterization of the g-th subgroup at time t. A representation of the entire activity is then given:

finally, it is filled into the Softmax classification layer and averaged for each frame as the final prediction vector for the group activity.

Claims

1. A group action recognition method based on a recurrent neural network is characterized by comprising the following steps:

step 6, averaging softmax scores at all moments to serve as a final prediction probability vector of group activity recognition;

the Single-Person LSTM model in step 3 is

Wherein i is input gating, f is forgotten gating, o is output gate, g is input modulation gate, c is storage unit, W_*xAnd W_*hAs a weight matrix, b_*Is a vector of the offset to the offset,

which means that the multiplication is performed element by element,

is an activation function; h is_tIs a hidden shapeStates, which contain the dynamic characteristics of the individual at time t;

in the step 4, the moving time of the individual in the whole activity process is represented by the whole action intensity of the individual, the stronger the whole action intensity is, the longer the whole action intensity is, and the whole action intensity is obtained through the following processes:

step S401, superposing the horizontal and vertical displacement vectors of each pixel point of the continuous T frames

Wherein, i is 1, 2., T,

and

denotes the horizontal and vertical displacement vectors u, 1,2, a, w, v, 1,2, a, h, respectively, at the point (u, v), the resolution of the image being w × h;

step S402, obtaining T frame continuous motion information SF of the k-th individual at point (u, v)^k(u,v,c)，c＝1,···,2T；

Step S403, obtaining the action intensity and the full-stroke action intensity of the kth individual:

wherein the content of the first and second substances,

represents the action intensity, MI, of the kth individual at time t^kRepresenting the kth individual's full motion intensity;

the specific process of transmitting the space-time characteristics of all individuals to Interaction Bi-LSTM to capture context information in step 4 is as follows:

interaction Bi-LSTM unit calculation forward feedback sequence

And a backward feedback sequence

Respectively iterating K persons from two directions of K → 1 and K → 1 → K, and outputting the sequence

Can be expressed as:

wherein K ═ 1,2, ·, K, H (·) is achieved by the definition of LSTM in step 3,

and

are respectively a weight matrix, b_*Is an offset vector, o denotes a sample operation;

the specific process of the step 5 is as follows:

step S501, constructing an Aggregation LSTM unit, and dividing the whole group activity of K individuals into N_gIdentifying a group, wherein g 1,2_gStarting index S of group g individuals_gAnd an end index E_gIs defined as

S_g＝(g-1)·K/N_g+1

E_g＝g·K/N_g

Step S502, for the kth individual of the g group population in the video segment, learning a weight factor