CN109446872B - Group action recognition method based on recurrent neural network - Google Patents

Group action recognition method based on recurrent neural network Download PDF

Info

Publication number
CN109446872B
CN109446872B CN201810971833.0A CN201810971833A CN109446872B CN 109446872 B CN109446872 B CN 109446872B CN 201810971833 A CN201810971833 A CN 201810971833A CN 109446872 B CN109446872 B CN 109446872B
Authority
CN
China
Prior art keywords
individual
lstm
time
group
individuals
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810971833.0A
Other languages
Chinese (zh)
Other versions
CN109446872A (en
Inventor
舒祥波
严锐
唐金辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201810971833.0A priority Critical patent/CN109446872B/en
Publication of CN109446872A publication Critical patent/CN109446872A/en
Application granted granted Critical
Publication of CN109446872B publication Critical patent/CN109446872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a population motion identification method, which comprises the following steps: at each moment, extracting the CNN characteristics of the individual as static characteristic representation thereof; modeling the individual dynamics from the static representation of the individual using a long-short term memory model; modeling Long Motion; modeling the interaction dynamics among individuals; modeling Flash Motion.

Description

Group action recognition method based on recurrent neural network
Technical Field
The invention provides a computer vision and multimedia technology, in particular to a group action recognition method based on a recurrent neural network.
Background
Motion recognition, which aims to enable computers to understand the motion occurring in video segments, is receiving increasing attention in the field of computer vision and multimedia. Human activities can be broadly divided into three categories, single person actions, interactive actions and group actions, depending on the number of action participants. Much previous work has focused more on the study of single person motion recognition and made good progress. In addition to single-person actions, real scenes tend to contain more interactive actions (e.g., "handshaking") and group actions (e.g., "queue", "cross-road"). In the interactive scenario, at least two people are interacting at the same time. In the context of group actions, an activity describes a more complex scenario/event involving individual person behavior and various other interactions (e.g., group-person and group-group interactions). In general, group activity recognition is a more challenging task than single person motion recognition and interactive motion recognition.
Disclosure of Invention
The invention aims to provide a group action recognition method based on a recurrent neural network, which comprises the following steps:
step 1, inputting a video segment to be detected, taking a middle T frame, and detecting all moving individuals in each frame;
step 2, extracting the spatial characteristics of all the motion individuals at each moment by using a convolutional neural network;
step 3, establishing a Single-Person LSTM model, and providing individual spatial characteristics to the Single-Person LSTM model to capture individual time dynamic characteristics;
step 4, according to the sequence of the moving time of the individual in the whole activity process, the space-time characteristics of all the individuals are transmitted to Interaction Bi-LSTM to capture context information;
step 5, assigning all hidden states in the Interaction Bi-LSTM with dynamic weights, finally integrating the hidden states into an Aggregation LSTM, and connecting multiple groups of Aggregation states into input of a softmax layer at a corresponding moment;
and 6, averaging the softmax scores at all the moments to serve as a final prediction probability vector of the group activity recognition.
Compared with the prior art, the invention has the following advantages: the invention explores a new concept of 'One to Key' and gradually integrates the space-time characteristics of each Key role in different degrees. The present invention focuses on two types of key roles, one is steady movement throughout the process (long movement time) and the other is violent movement occurring at a certain moment (but closely related to group action). On the basis, a novel participatory time dynamic model (PC-TDM) is provided for identifying group actions, and the novel participatory time dynamic model mainly comprises a 'One' network and a 'One to Key' network. In particular, the goal of "One" networks is to dynamically model individuals. The One to Key network inputs the output of the One network into a bidirectional LSTM (Bi-LSTM) in sequence according to the individual moving duration. Subsequently, each output state of the Bi-LSTM is weighted and aggregated. Experimental results show that the method can remarkably improve the group action recognition performance.
The invention is further described below with reference to the accompanying drawings.
Drawings
Fig. 1 is a framework diagram of the present invention.
FIG. 2 is a capture diagram of Long Motion.
FIG. 3 is a capture diagram of a Flash Motion.
FIG. 4 is a schematic flow chart of the method of the present invention.
Detailed Description
A group action recognition framework based on a recurrent neural Network comprises two sub-networks of 'One' Network (individual space-time Network) and 'One to Key' Network (Key participant time Network).
1. "One" network: individual spatiotemporal networks
Step 1, at each moment, extracting individual CNN (volumetric neural networks) features as static feature representation thereof.
Step 2, modeling the individual dynamics from the static representation of the individual using a long short term memory model (LSTM) (referred to herein as Single-Person LSTM). Formally, with X ═ X1,x2,...,xTIn which xtIs the spatial CNN feature at time step t extracted from the pre-trained CNN model. Input gating itForgetting to gate ftOutput gate otAnd an input modulation gate gtC of memory celltThe single person LSTM is defined as follows,
it=σ(Wixxt+Wihht-1+bi);
ft=σ(Wfxxt+Wfhht-1+bf);
ot=σ(Woxxt+Wohht-1+bo);
gt=σ(Wgxxt+Wghht-1+bg);
Figure GDA0003547009560000031
Figure GDA0003547009560000032
σ () is a sigmoid function; w*xAnd W*hIs a weight matrix; b*Is a bias vector;
Figure GDA00035470095600000313
representing element-by-element multiplication; and htIs a hidden state (containing the time dynamic characteristics of the individual), which contains the dynamic information of the individual at the time t,
Figure GDA0003547009560000033
is an activation function.
2. "One to Key" network: key participant time network
And 3, modeling the Long Motion. Long Motion is the Motion of a participant with continuous Motion throughout the process. The longer a person moves, the more important is the role he/she plays. To measure the Motion duration of the Long Motion in the entire video clip, the average Motion intensity of each person is measured by superimposing the optical stream images and calculating their average value, as shown in fig. 2. More formally, given a T-frame video clip, each frame has a resolution of w × h, respectively
Figure GDA0003547009560000034
And
Figure GDA0003547009560000035
representing the horizontal and vertical displacement vectors at point (u, v) (u 1, 2.., w; v 1, 2.., h). First, to be continuous T frames
Figure GDA0003547009560000036
And
Figure GDA0003547009560000037
together, the following components are added:
Figure GDA0003547009560000038
Figure GDA0003547009560000039
wherein i 1,2, T, followed by SFk(u, v, c) (c ═ 1, ·,2T), representing T-frame continuous motion information for the k-th individual at point (u, v). Accordingly, the long motion intensity of the kth individual is defined:
Figure GDA00035470095600000310
Figure GDA00035470095600000311
wherein
Figure GDA00035470095600000312
Represents the action intensity, MI, of the kth individual at time tkIndicating the kth individual's full motion intensity. Apparently, MI of one personkThe larger this means that the person is more involved in group activities throughout the process.
And 4, modeling the interaction dynamic state among individuals. Many of the previous works have been modeled in turn by roughly using the spatial location of all people. This ignores the fact that some closely located people are sometimes not relevant. It is clear that a person who is constantly moving (e.g., "moving", "jumping") has a lot of time to interact with others at many times. Therefore, mobile individuals with longer movement times should adopt the participation modeling as early as possible. Formally, according to the MI of each personkValues, whose features are sorted in descending order and used as the input sequence for the LSTM. Considering that the Interaction between two people is bidirectional, a new Interaction Bi-LSTM is used to model the Interaction sequence, rather than the traditional unidirectional LSTM. At time t, InteractThe ion Bi-LSTM unit calculates the forward feedback sequence
Figure GDA0003547009560000041
And a backward feedback sequence
Figure GDA0003547009560000042
K persons were iterated from two directions K → 1 and K → 1 → K, respectively. Output sequence
Figure GDA0003547009560000043
Can be expressed as:
Figure GDA0003547009560000044
Figure GDA0003547009560000045
Figure GDA0003547009560000046
wherein H () is implemented by the definition of LSTM in step 2),
Figure GDA0003547009560000047
and
Figure GDA0003547009560000048
is a weight matrix; b*Is a bias vector; o denotes a sampling operation. Different from the conventional Bi-LSTM splicing pre-and post-sequence, by pairing
Figure GDA0003547009560000049
And
Figure GDA00035470095600000410
sampling in each feature dimension to obtain a final output sequence representation
Figure GDA00035470095600000411
This can reduce not only redundant information but also computational overhead.
And 5, modeling the Flash Motion. In addition to long movements, some people do not have steady movements throughout their activities, and at some important moment they have strong movements, i.e. Flash Motion. These movements also provide important discriminatory information for identifying group activities. Taking the "left team" activity of a volleyball game as an example, as shown in fig. 3(a), several people participate in the volleyball game in a more intensive flashing motion. Their movements are closely related to the "ball left" activity, providing important information for understanding this activity. Since Flash Motion varies over time, assigning different weighting factors to discover key participants is considered. A straightforward approach is to compute each person's weight from the optical flow values between two consecutive frames. However, some flash movements that occur at important moments may not be related to team activity.
In this invention, an Aggregation LSTM is constructed, the weight factor of each person is learned through the individual action characteristics of her/his, and then the output state of Interaction Bi-LSTM is gradually gathered. If the individual actions are more consistent with the group activity, then the learned weighting factor will be larger, and vice versa. Dividing the overall group activity of K individuals into NgIdentifying a group, wherein g 1,2g. Starting index S of group g individualsgAnd an end index EgIt can be defined as follows,
Sg=(g-1)·K/Ng+1;
Eg=g·K/Ng,
for the kth individual of the g group population in the video segment, a weighting factor is learned
Figure GDA0003547009560000051
To control the output state of Interaction LSTM at t
Figure GDA0003547009560000052
To capture the intensity of flash motion:
Figure GDA0003547009560000053
Figure GDA0003547009560000054
wherein
Figure GDA0003547009560000055
WheIs a weight parameter matrix, beIs a bias vector, exp (×) is an exponential function. A potential representation of each person in the g-th group at time t is then obtained
Figure GDA0003547009560000056
Then, the Aggregation LSTM unit accepts hidden layer data at the previous moment
Figure GDA0003547009560000057
And characteristic data of current time
Figure GDA0003547009560000058
Can be simply expressed as follows:
Figure GDA0003547009560000059
Figure GDA00035470095600000510
wherein ZtgIs a characterization of the g-th subgroup at time t. A representation of the entire activity is then given:
Figure GDA00035470095600000511
finally, it is filled into the Softmax classification layer and averaged for each frame as the final prediction vector for the group activity.

Claims (1)

1. A group action recognition method based on a recurrent neural network is characterized by comprising the following steps:
step 1, inputting a video segment to be detected, taking a middle T frame, and detecting all moving individuals in each frame;
step 2, extracting the spatial characteristics of all the motion individuals at each moment by using a convolutional neural network;
step 3, establishing a Single-Person LSTM model, and providing individual spatial characteristics to the Single-Person LSTM model to capture individual time dynamic characteristics;
step 4, according to the sequence of the moving time of the individual in the whole activity process, the space-time characteristics of all the individuals are transmitted to Interaction Bi-LSTM to capture context information;
step 5, assigning all hidden states in the Interaction Bi-LSTM with dynamic weights, finally integrating the hidden states into an Aggregation LSTM, and connecting multiple groups of Aggregation states into input of a softmax layer at a corresponding moment;
step 6, averaging softmax scores at all moments to serve as a final prediction probability vector of group activity recognition;
the Single-Person LSTM model in step 3 is
Figure FDA0003292840220000011
Wherein i is input gating, f is forgotten gating, o is output gate, g is input modulation gate, c is storage unit, W*xAnd W*hAs a weight matrix, b*Is a vector of the offset to the offset,
Figure FDA0003292840220000016
which means that the multiplication is performed element by element,
Figure FDA0003292840220000012
is an activation function; h istIs a hidden shapeStates, which contain the dynamic characteristics of the individual at time t;
in the step 4, the moving time of the individual in the whole activity process is represented by the whole action intensity of the individual, the stronger the whole action intensity is, the longer the whole action intensity is, and the whole action intensity is obtained through the following processes:
step S401, superposing the horizontal and vertical displacement vectors of each pixel point of the continuous T frames
Figure FDA0003292840220000013
Wherein, i is 1, 2., T,
Figure FDA0003292840220000014
and
Figure FDA0003292840220000015
denotes the horizontal and vertical displacement vectors u, 1,2, a, w, v, 1,2, a, h, respectively, at the point (u, v), the resolution of the image being w × h;
step S402, obtaining T frame continuous motion information SF of the k-th individual at point (u, v)k(u,v,c),c=1,···,2T;
Step S403, obtaining the action intensity and the full-stroke action intensity of the kth individual:
Figure FDA0003292840220000021
wherein the content of the first and second substances,
Figure FDA0003292840220000022
represents the action intensity, MI, of the kth individual at time tkRepresenting the kth individual's full motion intensity;
the specific process of transmitting the space-time characteristics of all individuals to Interaction Bi-LSTM to capture context information in step 4 is as follows:
interaction Bi-LSTM unit calculation forward feedback sequence
Figure FDA0003292840220000023
And a backward feedback sequence
Figure FDA0003292840220000024
Respectively iterating K persons from two directions of K → 1 and K → 1 → K, and outputting the sequence
Figure FDA0003292840220000025
Can be expressed as:
Figure FDA0003292840220000026
wherein K ═ 1,2, ·, K, H (·) is achieved by the definition of LSTM in step 3,
Figure FDA0003292840220000027
and
Figure FDA0003292840220000028
are respectively a weight matrix, b*Is an offset vector, o denotes a sample operation;
the specific process of the step 5 is as follows:
step S501, constructing an Aggregation LSTM unit, and dividing the whole group activity of K individuals into NgIdentifying a group, wherein g 1,2gStarting index S of group g individualsgAnd an end index EgIs defined as
Sg=(g-1)·K/Ng+1
Eg=g·K/Ng
Step S502, for the kth individual of the g group population in the video segment, learning a weight factor
Figure FDA0003292840220000031
To control the output state of Interaction LSTM at time t
Figure FDA0003292840220000032
To capture potential representations of each person in the g-th group at time t
Figure FDA0003292840220000033
Figure FDA0003292840220000034
Wherein
Figure FDA0003292840220000035
WheIs a weight parameter matrix, beIs a bias vector, exp (×) is an exponential function;
step S503, the Aggregation LSTM unit receives hidden layer data at the previous time
Figure FDA0003292840220000036
And the current time
Figure FDA0003292840220000037
Figure FDA0003292840220000038
Wherein Z istgIs a characteristic representation of the g subgroup at time t;
step S504, obtaining a representation of the entire activity:
Figure FDA0003292840220000039
CN201810971833.0A 2018-08-24 2018-08-24 Group action recognition method based on recurrent neural network Active CN109446872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810971833.0A CN109446872B (en) 2018-08-24 2018-08-24 Group action recognition method based on recurrent neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810971833.0A CN109446872B (en) 2018-08-24 2018-08-24 Group action recognition method based on recurrent neural network

Publications (2)

Publication Number Publication Date
CN109446872A CN109446872A (en) 2019-03-08
CN109446872B true CN109446872B (en) 2022-04-19

Family

ID=65530486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810971833.0A Active CN109446872B (en) 2018-08-24 2018-08-24 Group action recognition method based on recurrent neural network

Country Status (1)

Country Link
CN (1) CN109446872B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765956B (en) * 2019-10-28 2021-10-29 西安电子科技大学 Double-person interactive behavior recognition method based on component characteristics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866429A (en) * 2010-06-01 2010-10-20 中国科学院计算技术研究所 Training method of multi-moving object action identification and multi-moving object action identification method
CN106407889A (en) * 2016-08-26 2017-02-15 上海交通大学 Video human body interaction motion identification method based on optical flow graph depth learning model
CN107179683A (en) * 2017-04-01 2017-09-19 浙江工业大学 Interactive robot intelligent motion detection and control method based on neural network
CN108399435A (en) * 2018-03-21 2018-08-14 南京邮电大学 A kind of video classification methods based on sound feature

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9977968B2 (en) * 2016-03-04 2018-05-22 Xerox Corporation System and method for relevance estimation in summarization of videos of multi-step activities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866429A (en) * 2010-06-01 2010-10-20 中国科学院计算技术研究所 Training method of multi-moving object action identification and multi-moving object action identification method
CN106407889A (en) * 2016-08-26 2017-02-15 上海交通大学 Video human body interaction motion identification method based on optical flow graph depth learning model
CN107179683A (en) * 2017-04-01 2017-09-19 浙江工业大学 Interactive robot intelligent motion detection and control method based on neural network
CN108399435A (en) * 2018-03-21 2018-08-14 南京邮电大学 A kind of video classification methods based on sound feature

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Human Motion Analysis with Deep Metric Learning;Huseyin Coskun et al;《arXiv:1807.11176v1》;20180730;1-17 *

Also Published As

Publication number Publication date
CN109446872A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
Xu et al. Predicting head movement in panoramic video: A deep reinforcement learning approach
CN112597883B (en) Human skeleton action recognition method based on generalized graph convolution and reinforcement learning
US10216983B2 (en) Techniques for assessing group level cognitive states
CN112784698B (en) No-reference video quality evaluation method based on deep space-time information
Cai et al. Deep historical long short-term memory network for action recognition
CN111881776B (en) Dynamic expression acquisition method and device, storage medium and electronic equipment
CN111340105A (en) Image classification model training method, image classification device and computing equipment
Yan et al. Predicting human interaction via relative attention model
Vemprala et al. Representation learning for event-based visuomotor policies
CN114783043B (en) Child behavior track positioning method and system
Wang et al. Basketball shooting angle calculation and analysis by deeply-learned vision model
Wasim et al. A novel deep learning based automated academic activities recognition in cyber-physical systems
CN115346262A (en) Method, device and equipment for determining expression driving parameters and storage medium
CN117671787A (en) Rehabilitation action evaluation method based on transducer
Khan et al. Classification of human's activities from gesture recognition in live videos using deep learning
Amara et al. Towards emotion recognition in immersive virtual environments: a method for facial emotion recognition
CN109446872B (en) Group action recognition method based on recurrent neural network
CN110580456A (en) Group activity identification method based on coherent constraint graph long-time memory network
Othman et al. Challenges and Limitations in Human Action Recognition on Unmanned Aerial Vehicles: A Comprehensive Survey.
Zhang et al. Skeleton-based action recognition with attention and temporal graph convolutional network
Usman et al. Skeleton-based motion prediction: A survey
CN115393963A (en) Motion action correcting method, system, storage medium, computer equipment and terminal
CN114898275A (en) Student activity track analysis method
Xiao et al. Gaze prediction based on long short-term memory convolution with associated features of video frames
Manolova et al. Human activity recognition with semantically guided graph-convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant