CN109446872B - Group action recognition method based on recurrent neural network - Google Patents
Group action recognition method based on recurrent neural network Download PDFInfo
- Publication number
- CN109446872B CN109446872B CN201810971833.0A CN201810971833A CN109446872B CN 109446872 B CN109446872 B CN 109446872B CN 201810971833 A CN201810971833 A CN 201810971833A CN 109446872 B CN109446872 B CN 109446872B
- Authority
- CN
- China
- Prior art keywords
- individual
- lstm
- time
- group
- individuals
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a population motion identification method, which comprises the following steps: at each moment, extracting the CNN characteristics of the individual as static characteristic representation thereof; modeling the individual dynamics from the static representation of the individual using a long-short term memory model; modeling Long Motion; modeling the interaction dynamics among individuals; modeling Flash Motion.
Description
Technical Field
The invention provides a computer vision and multimedia technology, in particular to a group action recognition method based on a recurrent neural network.
Background
Motion recognition, which aims to enable computers to understand the motion occurring in video segments, is receiving increasing attention in the field of computer vision and multimedia. Human activities can be broadly divided into three categories, single person actions, interactive actions and group actions, depending on the number of action participants. Much previous work has focused more on the study of single person motion recognition and made good progress. In addition to single-person actions, real scenes tend to contain more interactive actions (e.g., "handshaking") and group actions (e.g., "queue", "cross-road"). In the interactive scenario, at least two people are interacting at the same time. In the context of group actions, an activity describes a more complex scenario/event involving individual person behavior and various other interactions (e.g., group-person and group-group interactions). In general, group activity recognition is a more challenging task than single person motion recognition and interactive motion recognition.
Disclosure of Invention
The invention aims to provide a group action recognition method based on a recurrent neural network, which comprises the following steps:
step 2, extracting the spatial characteristics of all the motion individuals at each moment by using a convolutional neural network;
step 3, establishing a Single-Person LSTM model, and providing individual spatial characteristics to the Single-Person LSTM model to capture individual time dynamic characteristics;
step 4, according to the sequence of the moving time of the individual in the whole activity process, the space-time characteristics of all the individuals are transmitted to Interaction Bi-LSTM to capture context information;
step 5, assigning all hidden states in the Interaction Bi-LSTM with dynamic weights, finally integrating the hidden states into an Aggregation LSTM, and connecting multiple groups of Aggregation states into input of a softmax layer at a corresponding moment;
and 6, averaging the softmax scores at all the moments to serve as a final prediction probability vector of the group activity recognition.
Compared with the prior art, the invention has the following advantages: the invention explores a new concept of 'One to Key' and gradually integrates the space-time characteristics of each Key role in different degrees. The present invention focuses on two types of key roles, one is steady movement throughout the process (long movement time) and the other is violent movement occurring at a certain moment (but closely related to group action). On the basis, a novel participatory time dynamic model (PC-TDM) is provided for identifying group actions, and the novel participatory time dynamic model mainly comprises a 'One' network and a 'One to Key' network. In particular, the goal of "One" networks is to dynamically model individuals. The One to Key network inputs the output of the One network into a bidirectional LSTM (Bi-LSTM) in sequence according to the individual moving duration. Subsequently, each output state of the Bi-LSTM is weighted and aggregated. Experimental results show that the method can remarkably improve the group action recognition performance.
The invention is further described below with reference to the accompanying drawings.
Drawings
Fig. 1 is a framework diagram of the present invention.
FIG. 2 is a capture diagram of Long Motion.
FIG. 3 is a capture diagram of a Flash Motion.
FIG. 4 is a schematic flow chart of the method of the present invention.
Detailed Description
A group action recognition framework based on a recurrent neural Network comprises two sub-networks of 'One' Network (individual space-time Network) and 'One to Key' Network (Key participant time Network).
1. "One" network: individual spatiotemporal networks
Step 2, modeling the individual dynamics from the static representation of the individual using a long short term memory model (LSTM) (referred to herein as Single-Person LSTM). Formally, with X ═ X1,x2,...,xTIn which xtIs the spatial CNN feature at time step t extracted from the pre-trained CNN model. Input gating itForgetting to gate ftOutput gate otAnd an input modulation gate gtC of memory celltThe single person LSTM is defined as follows,
it=σ(Wixxt+Wihht-1+bi);
ft=σ(Wfxxt+Wfhht-1+bf);
ot=σ(Woxxt+Wohht-1+bo);
gt=σ(Wgxxt+Wghht-1+bg);
σ () is a sigmoid function; w*xAnd W*hIs a weight matrix; b*Is a bias vector;representing element-by-element multiplication; and htIs a hidden state (containing the time dynamic characteristics of the individual), which contains the dynamic information of the individual at the time t,is an activation function.
2. "One to Key" network: key participant time network
And 3, modeling the Long Motion. Long Motion is the Motion of a participant with continuous Motion throughout the process. The longer a person moves, the more important is the role he/she plays. To measure the Motion duration of the Long Motion in the entire video clip, the average Motion intensity of each person is measured by superimposing the optical stream images and calculating their average value, as shown in fig. 2. More formally, given a T-frame video clip, each frame has a resolution of w × h, respectivelyAndrepresenting the horizontal and vertical displacement vectors at point (u, v) (u 1, 2.., w; v 1, 2.., h). First, to be continuous T framesAndtogether, the following components are added:
wherein i 1,2, T, followed by SFk(u, v, c) (c ═ 1, ·,2T), representing T-frame continuous motion information for the k-th individual at point (u, v). Accordingly, the long motion intensity of the kth individual is defined:
whereinRepresents the action intensity, MI, of the kth individual at time tkIndicating the kth individual's full motion intensity. Apparently, MI of one personkThe larger this means that the person is more involved in group activities throughout the process.
And 4, modeling the interaction dynamic state among individuals. Many of the previous works have been modeled in turn by roughly using the spatial location of all people. This ignores the fact that some closely located people are sometimes not relevant. It is clear that a person who is constantly moving (e.g., "moving", "jumping") has a lot of time to interact with others at many times. Therefore, mobile individuals with longer movement times should adopt the participation modeling as early as possible. Formally, according to the MI of each personkValues, whose features are sorted in descending order and used as the input sequence for the LSTM. Considering that the Interaction between two people is bidirectional, a new Interaction Bi-LSTM is used to model the Interaction sequence, rather than the traditional unidirectional LSTM. At time t, InteractThe ion Bi-LSTM unit calculates the forward feedback sequenceAnd a backward feedback sequenceK persons were iterated from two directions K → 1 and K → 1 → K, respectively. Output sequenceCan be expressed as:
wherein H () is implemented by the definition of LSTM in step 2),andis a weight matrix; b*Is a bias vector; o denotes a sampling operation. Different from the conventional Bi-LSTM splicing pre-and post-sequence, by pairingAndsampling in each feature dimension to obtain a final output sequence representationThis can reduce not only redundant information but also computational overhead.
And 5, modeling the Flash Motion. In addition to long movements, some people do not have steady movements throughout their activities, and at some important moment they have strong movements, i.e. Flash Motion. These movements also provide important discriminatory information for identifying group activities. Taking the "left team" activity of a volleyball game as an example, as shown in fig. 3(a), several people participate in the volleyball game in a more intensive flashing motion. Their movements are closely related to the "ball left" activity, providing important information for understanding this activity. Since Flash Motion varies over time, assigning different weighting factors to discover key participants is considered. A straightforward approach is to compute each person's weight from the optical flow values between two consecutive frames. However, some flash movements that occur at important moments may not be related to team activity.
In this invention, an Aggregation LSTM is constructed, the weight factor of each person is learned through the individual action characteristics of her/his, and then the output state of Interaction Bi-LSTM is gradually gathered. If the individual actions are more consistent with the group activity, then the learned weighting factor will be larger, and vice versa. Dividing the overall group activity of K individuals into NgIdentifying a group, wherein g 1,2g. Starting index S of group g individualsgAnd an end index EgIt can be defined as follows,
Sg=(g-1)·K/Ng+1;
Eg=g·K/Ng,
for the kth individual of the g group population in the video segment, a weighting factor is learnedTo control the output state of Interaction LSTM at tTo capture the intensity of flash motion:
whereinWheIs a weight parameter matrix, beIs a bias vector, exp (×) is an exponential function. A potential representation of each person in the g-th group at time t is then obtainedThen, the Aggregation LSTM unit accepts hidden layer data at the previous momentAnd characteristic data of current timeCan be simply expressed as follows:
wherein ZtgIs a characterization of the g-th subgroup at time t. A representation of the entire activity is then given:
finally, it is filled into the Softmax classification layer and averaged for each frame as the final prediction vector for the group activity.
Claims (1)
1. A group action recognition method based on a recurrent neural network is characterized by comprising the following steps:
step 1, inputting a video segment to be detected, taking a middle T frame, and detecting all moving individuals in each frame;
step 2, extracting the spatial characteristics of all the motion individuals at each moment by using a convolutional neural network;
step 3, establishing a Single-Person LSTM model, and providing individual spatial characteristics to the Single-Person LSTM model to capture individual time dynamic characteristics;
step 4, according to the sequence of the moving time of the individual in the whole activity process, the space-time characteristics of all the individuals are transmitted to Interaction Bi-LSTM to capture context information;
step 5, assigning all hidden states in the Interaction Bi-LSTM with dynamic weights, finally integrating the hidden states into an Aggregation LSTM, and connecting multiple groups of Aggregation states into input of a softmax layer at a corresponding moment;
step 6, averaging softmax scores at all moments to serve as a final prediction probability vector of group activity recognition;
the Single-Person LSTM model in step 3 is
Wherein i is input gating, f is forgotten gating, o is output gate, g is input modulation gate, c is storage unit, W*xAnd W*hAs a weight matrix, b*Is a vector of the offset to the offset,which means that the multiplication is performed element by element,is an activation function; h istIs a hidden shapeStates, which contain the dynamic characteristics of the individual at time t;
in the step 4, the moving time of the individual in the whole activity process is represented by the whole action intensity of the individual, the stronger the whole action intensity is, the longer the whole action intensity is, and the whole action intensity is obtained through the following processes:
step S401, superposing the horizontal and vertical displacement vectors of each pixel point of the continuous T frames
Wherein, i is 1, 2., T,anddenotes the horizontal and vertical displacement vectors u, 1,2, a, w, v, 1,2, a, h, respectively, at the point (u, v), the resolution of the image being w × h;
step S402, obtaining T frame continuous motion information SF of the k-th individual at point (u, v)k(u,v,c),c=1,···,2T;
Step S403, obtaining the action intensity and the full-stroke action intensity of the kth individual:
wherein the content of the first and second substances,represents the action intensity, MI, of the kth individual at time tkRepresenting the kth individual's full motion intensity;
the specific process of transmitting the space-time characteristics of all individuals to Interaction Bi-LSTM to capture context information in step 4 is as follows:
interaction Bi-LSTM unit calculation forward feedback sequenceAnd a backward feedback sequenceRespectively iterating K persons from two directions of K → 1 and K → 1 → K, and outputting the sequenceCan be expressed as:
wherein K ═ 1,2, ·, K, H (·) is achieved by the definition of LSTM in step 3,andare respectively a weight matrix, b*Is an offset vector, o denotes a sample operation;
the specific process of the step 5 is as follows:
step S501, constructing an Aggregation LSTM unit, and dividing the whole group activity of K individuals into NgIdentifying a group, wherein g 1,2gStarting index S of group g individualsgAnd an end index EgIs defined as
Sg=(g-1)·K/Ng+1
Eg=g·K/Ng
Step S502, for the kth individual of the g group population in the video segment, learning a weight factorTo control the output state of Interaction LSTM at time tTo capture potential representations of each person in the g-th group at time t
step S503, the Aggregation LSTM unit receives hidden layer data at the previous timeAnd the current time
Wherein Z istgIs a characteristic representation of the g subgroup at time t;
step S504, obtaining a representation of the entire activity:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810971833.0A CN109446872B (en) | 2018-08-24 | 2018-08-24 | Group action recognition method based on recurrent neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810971833.0A CN109446872B (en) | 2018-08-24 | 2018-08-24 | Group action recognition method based on recurrent neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109446872A CN109446872A (en) | 2019-03-08 |
CN109446872B true CN109446872B (en) | 2022-04-19 |
Family
ID=65530486
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810971833.0A Active CN109446872B (en) | 2018-08-24 | 2018-08-24 | Group action recognition method based on recurrent neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109446872B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765956B (en) * | 2019-10-28 | 2021-10-29 | 西安电子科技大学 | Double-person interactive behavior recognition method based on component characteristics |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866429A (en) * | 2010-06-01 | 2010-10-20 | 中国科学院计算技术研究所 | Training method of multi-moving object action identification and multi-moving object action identification method |
CN106407889A (en) * | 2016-08-26 | 2017-02-15 | 上海交通大学 | Video human body interaction motion identification method based on optical flow graph depth learning model |
CN107179683A (en) * | 2017-04-01 | 2017-09-19 | 浙江工业大学 | Interactive robot intelligent motion detection and control method based on neural network |
CN108399435A (en) * | 2018-03-21 | 2018-08-14 | 南京邮电大学 | A kind of video classification methods based on sound feature |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9977968B2 (en) * | 2016-03-04 | 2018-05-22 | Xerox Corporation | System and method for relevance estimation in summarization of videos of multi-step activities |
-
2018
- 2018-08-24 CN CN201810971833.0A patent/CN109446872B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101866429A (en) * | 2010-06-01 | 2010-10-20 | 中国科学院计算技术研究所 | Training method of multi-moving object action identification and multi-moving object action identification method |
CN106407889A (en) * | 2016-08-26 | 2017-02-15 | 上海交通大学 | Video human body interaction motion identification method based on optical flow graph depth learning model |
CN107179683A (en) * | 2017-04-01 | 2017-09-19 | 浙江工业大学 | Interactive robot intelligent motion detection and control method based on neural network |
CN108399435A (en) * | 2018-03-21 | 2018-08-14 | 南京邮电大学 | A kind of video classification methods based on sound feature |
Non-Patent Citations (1)
Title |
---|
Human Motion Analysis with Deep Metric Learning;Huseyin Coskun et al;《arXiv:1807.11176v1》;20180730;1-17 * |
Also Published As
Publication number | Publication date |
---|---|
CN109446872A (en) | 2019-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xu et al. | Predicting head movement in panoramic video: A deep reinforcement learning approach | |
CN112597883B (en) | Human skeleton action recognition method based on generalized graph convolution and reinforcement learning | |
US10216983B2 (en) | Techniques for assessing group level cognitive states | |
CN112784698B (en) | No-reference video quality evaluation method based on deep space-time information | |
Cai et al. | Deep historical long short-term memory network for action recognition | |
CN111881776B (en) | Dynamic expression acquisition method and device, storage medium and electronic equipment | |
CN111340105A (en) | Image classification model training method, image classification device and computing equipment | |
Yan et al. | Predicting human interaction via relative attention model | |
Vemprala et al. | Representation learning for event-based visuomotor policies | |
CN114783043B (en) | Child behavior track positioning method and system | |
Wang et al. | Basketball shooting angle calculation and analysis by deeply-learned vision model | |
Wasim et al. | A novel deep learning based automated academic activities recognition in cyber-physical systems | |
CN115346262A (en) | Method, device and equipment for determining expression driving parameters and storage medium | |
CN117671787A (en) | Rehabilitation action evaluation method based on transducer | |
Khan et al. | Classification of human's activities from gesture recognition in live videos using deep learning | |
Amara et al. | Towards emotion recognition in immersive virtual environments: a method for facial emotion recognition | |
CN109446872B (en) | Group action recognition method based on recurrent neural network | |
CN110580456A (en) | Group activity identification method based on coherent constraint graph long-time memory network | |
Othman et al. | Challenges and Limitations in Human Action Recognition on Unmanned Aerial Vehicles: A Comprehensive Survey. | |
Zhang et al. | Skeleton-based action recognition with attention and temporal graph convolutional network | |
Usman et al. | Skeleton-based motion prediction: A survey | |
CN115393963A (en) | Motion action correcting method, system, storage medium, computer equipment and terminal | |
CN114898275A (en) | Student activity track analysis method | |
Xiao et al. | Gaze prediction based on long short-term memory convolution with associated features of video frames | |
Manolova et al. | Human activity recognition with semantically guided graph-convolutional network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |