CN114821766A

CN114821766A - Behavior identification method based on space-time convolution and time sequence feature fusion

Info

Publication number: CN114821766A
Application number: CN202210229686.6A
Authority: CN
Inventors: 李宏亮; 黄俊强; 董建伟; 盛一航; 任子奕
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-07-29

Abstract

The invention provides a behavior recognition method based on space-time convolution and time sequence feature fusion, aiming at the problem of insufficient feature extraction under the condition of single data form downlink for recognition, and the behavior recognition method comprises the steps of firstly obtaining video stream data and motion data from an inertial sensor, and then extracting high-level space-time semantic features based on space-time convolution from the video stream data; meanwhile, deep motion characteristics based on time series are extracted from the motion data stream; and fusing the high-level spatiotemporal semantic features and the deep motion features to obtain fusion features, inputting the fusion features into a multi-level sensor MLP (maximum likelihood probability) for mapping to obtain output values, and finishing behavior identification and classification by Softmax on the output values. The behavior recognition is completed based on a mode of fusing the two features, the defect that a mainstream algorithm uses single feature to lack information is overcome, the action features of the key moment are captured through the self-attention module, and the recognition accuracy of the network on abnormal and sudden behaviors is improved.

Description

Behavior identification method based on space-time convolution and time sequence feature fusion

Technical Field

The invention relates to a multi-mode-based feature fusion behavior recognition technology, and belongs to the field of deep learning.

Background

With the development of electronic and computer technologies, the intelligence and the practicability of smart wearable devices are gradually increasing. At present, portable intelligent glasses such as apple intelligent glasses and Google intelligent glasses are released by a plurality of high-tech companies. The smart device enables people to record daily life in the form of a first perspective and to record daily activity data via the inertial sensor. The data has potential value difficult to estimate, can be used for improving the life quality of people, and can relieve the social pressure of the elderly losing care and accompany under the global aging background. Behavior recognition is used as a hot spot in the field of artificial intelligence, can record behaviors and recognize the behaviors, can warn abnormal behaviors, and plays a role in nursing and emergency warning.

At present, algorithms based on deep learning and behavior recognition of neural networks have been widely used, wherein the 3D convolutional neural network based on space-time convolution is an important branch of the behavior recognition method for video extraction features. Furthermore, the use of the recurrent convolutional neural network RNN to extract motion features from inertial sensor data is another branch of behavior recognition.

The video data is complex multidimensional data with three dimensions (time and space), high-level abstract information of current daily actions is often contained in a time structure, and the video data can record motion backgrounds and overall actions of human bodies. By using the space-time convolutional network 3D convolutional neural network, deep semantic information of the video stream can be extracted without losing time-related information of the motion. The 3D convolutional neural network is usually a network expansion and structure type of 2D convolutional network such as Resnet, inclusion, etc., but has a time dimension, so as to improve the capturing capability of the motion time characteristics. The data of the inertial sensor comprises angles and accelerations in three axis directions of a sporter, which are acquired by the gyroscope and the accelerometer, the data of the angles and the accelerations in the three axis directions of the sporter are time sequence continuous information of T x 6, and the change condition of the physical quantity of the limb movement is recorded by emphasizing. The action time sequence change characteristics can be extracted by using the cyclic convolution neural network RNN, and different limb action identifications are completed.

In the current mainstream algorithm, behavior recognition is completed by only adopting a 3D convolutional network to extract video stream features, and useless background information contained in a video is a barrier to extracting action features. Based on the current display and card computing power, a 3D convolutional network can only sample 16 or 32 pictures from thousands of frames of images in a video, and cannot contain all moments, and time-related information of an action cannot be completely extracted from a long video, so that difficulty in behavior identification is increased. In a few researches, the RNN is adopted to extract action time sequence change characteristics from the inertial sensor data to complete action recognition, and the inertial sensor data only contain the physical quantity change of limb movement without any background information, so that the change and rhythmical similar actions are difficult to distinguish well.

Disclosure of Invention

The invention aims to solve the technical problem of insufficient feature extraction under the condition of single data form downlink identification, and provides a method for extracting motion features and useful background features of video data and inertial sensor data in a combined manner and improving the accuracy of behavior identification by using a neural network of a mixed structure.

The technical scheme adopted by the invention for solving the problems is as follows: a behavior identification method based on space-time convolution and time series feature fusion comprises the following steps:

1) acquiring video stream data and a motion data stream from an inertial sensor;

2) extracting global spatial features of frame images from video stream data, sending the global spatial features to a pooling layer for feature compression, and sending the compressed global spatial features to a 3D convolutional network to extract high-level spatio-temporal semantic features based on spatio-temporal convolution; meanwhile, the motion data stream is sent into a double-layer bidirectional BilSTM, the limb motion characteristics are extracted by combining hidden layer characteristics at all moments, the limb motion characteristics are input into a double-head self-attention mechanism so as to strengthen the motion information of key moments with weights, and deep motion characteristics based on time sequences are output after full-connection feedforward networks and normalization are carried out;

3) and fusing the high-level spatiotemporal semantic features and the deep motion features to obtain fusion features, inputting the fusion features into a multi-level sensor MLP (maximum likelihood probability) for mapping to obtain output values, and finishing behavior identification and classification by Softmax on the output values.

The method has the advantages that high-level space-time semantic features and deep motion features are jointly extracted from video stream data and inertial sensor stream data through the space-time convolutional network and the time sequence cyclic convolutional network, behavior recognition is completed based on a mode of fusing the two features, the defect that a mainstream algorithm uses single feature to lack information is overcome, the motion features of key moments are captured through the self-attention module, and the accuracy of the network in recognizing abnormal and sudden behaviors is improved.

Drawings

FIG. 1 is a flow chart of an example embodiment;

FIG. 2 is a schematic diagram of a 3D convolution module;

FIG. 3 is a process diagram of a fusion module.

Detailed Description

The embodiment is mainly realized on a linux platform, network training is completed on a TITANX display card, a behavior recognition set data set with mixed video and inertial sensor data needs to be constructed, the invention adopts quick-eye vision-improving intelligent glasses, develops a program for remotely acquiring the video data and the inertial sensor data based on Socket network sockets, and finally completes shooting of a daily behavior data set in a head-mounted mode.

The implementation of behavior recognition mainly comprises 3 steps:

1. the method comprises the steps of carrying out down-sampling, cutting and data enhancement on an input video to obtain video stream data, and meanwhile, carrying out filtering, abnormal value removal and normalization on the input inertial sensor data to obtain a motion data stream.

2. Extracting global spatial features of a frame image from video stream data through 3D convolution, sending the global spatial features to a pooling layer for feature compression, and sending the compressed global spatial features to a 3D convolution network to extract high-level spatio-temporal semantic features based on spatio-temporal convolution; meanwhile, the motion data stream is sent into a double-layer bidirectional BilSTM with a vertical structure and combined with motion changes at all moments, limb motion characteristics are extracted from a hidden layer of the double-layer bidirectional BilSTM, the limb motion characteristics are sent into a double-head self-attention mechanism module, motion information at key moments is strengthened by weight, the network is more stable for long-sequence attention training through a fully-connected feedforward network FFN, and finally deep motion characteristics based on a time sequence are output through LayerNorm layer normalization;

3) and (3) fusing the high-level spatiotemporal semantic features and the deep motion features, inputting the fused features into a multi-level sensor MLP to complete feature classification, and finally identifying behaviors through Softmax.

The specific network algorithm of the embodiment mainly comprises the following steps as shown in fig. 1: the video branch network extracts deep space-time semantic features of the video, the motion sensor branch network extracts action time sequence change features, the fusion network module fuses the two branch characteristics, and finally behavior recognition training is completed through the combined feature spectrum.

The video branch network is implemented as follows:

the first step is as follows: sampling 32 frames of images from a video through a random frame sampling algorithm, clipping the images to 224x224 size, wherein the clipping method comprises center clipping, random aspect ratio clipping, and then randomly turning the images horizontally, vertically and randomly.

The second step: and inputting the obtained data stream into a 64-channel 1x7x7 convolution to obtain a global space feature, wherein the time dimension of the convolution step is 1, and the space dimension is 2x2, inputting the global space feature into a max pooling layer of 1x3x3, and performing feature compression to obtain the compressed global space feature.

The third step: sending the compressed global space features into a 3D convolutional network to obtain high-level space-time semantic features, wherein the 3D convolutional network can be a 3D Resnet or a 3D increment junctionStructure of the organization. In this embodiment, 4 3D residual structure Rsenet group modules are adopted, and one 3D Rsenet group module is a 3D residual structure formed by 1x1x1 convolution, 1x3x3 convolution and 1x1x1 convolution as shown in fig. 2. And taking the high-level space-time semantic features as video path features.

The motion sensor branch network is implemented as follows:

the first step is as follows: and (5) processing the gyroscope and acceleration sensor data of T x 6 by filtering, and sampling noise and abnormal values by the sensor.

The second step is that: and sending the filtered motion data stream into a double-layer bidirectional BilSTM with a vertical structure, wherein the hidden layer feature dimension is 256, and extracting the limb motion features from the hidden layer by combining the hidden layer features at all times.

The third step: as shown in fig. 2, the obtained limb movement characteristics are sent to a double-head Self-attention module, the weighted limb movement characteristics at the key moment are scored, and then a fully-connected feedforward network FFN is passed by following a residual error structure. The network is more stable to long sequence attention training, and deep motion characteristics are output through LayerNorm layer normalization. The deep motion features are used as sensor path features.

Wherein the attention weight calculation formula is as follows:

the calculated weight score matrix can strengthen the capturing capability of the limb movement characteristics at the key moment.

The fusion network module is implemented as shown in fig. 3: matching feature dimensions of the video path features and the sensor path features through 1x1 convolution respectively, fusing two paths of feature spectrums through an Embedding method, mapping the fused features into output values through an MLP (maximum likelihood prediction) network, and finally generating behavior recognition results through Softmax for the output values.

In the embodiment, the Adam gradient descent method is adopted for network parameter updating and training, and the learning rate is increased firstly and then decreased by adopting a cosine method to adjust the learning step of the parameters.

Claims

1. A behavior identification method based on space-time convolution and time series feature fusion is characterized by comprising the following steps:

2) extracting global spatial features of frame images from video stream data, sending the global spatial features into a pooling layer for feature compression, and sending the compressed global spatial features into a 3D convolutional network to extract high-level space-time semantic features based on space-time convolution; meanwhile, the motion data stream is sent into a double-layer bidirectional BilSTM, the limb motion characteristics are extracted by combining hidden layer characteristics at all moments, the limb motion characteristics are input into a double-head self-attention mechanism so as to strengthen the motion information of key moments with weights, and deep motion characteristics based on time sequences are output after full-connection feedforward networks and normalization are carried out;

2. The method of claim 1, wherein the video stream data is obtained by down-sampling, cutting, and enhancing the input video to obtain video stream data;

the specific method for acquiring the motion data stream is to filter, remove abnormal values and normalize the input inertial sensor data to obtain the motion data stream.

3. The method of claim 1, wherein the global spatial features of the frame image are extracted by convolving the video stream data with 1x7x 7;

the global spatial features are fed into the max pooling layer of 1x3x3 for feature compression.

4. The method of claim 1, wherein the 3D convolutional network takes the form of sequentially concatenating 4 blocks of the 3D residual structure set.

5. The method of claim 4, wherein one 3D residual structure group block is a 3D residual structure consisting of 1x1x1 convolution, 1x3x3 convolution and 1x1x1 convolution.