CN108830252B

CN108830252B - Convolutional neural network human body action recognition method fusing global space-time characteristics

Info

Publication number: CN108830252B
Application number: CN201810671262.9A
Authority: CN
Inventors: 李瑞峰; 王珂; 程宝平; 武军
Original assignee: Harbin Institute of Technology; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: Harbin Institute of Technology; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2018-06-26
Filing date: 2018-06-26
Publication date: 2021-09-10
Anticipated expiration: 2038-06-26
Also published as: CN108830252A

Abstract

A convolutional neural network human body action recognition method fusing global space-time characteristics belongs to the technical field of human body action recognition. The invention solves the problem of low accuracy of motion recognition in the traditional motion recognition method. The method selects an Inception V3 basic network structure, establishes a space channel network and a global time domain channel network, cuts a UCF101 video data set into single-frame static images, and divides the single-frame static images into a training and testing set to train and test the space channel network; calculating an energy motion history map corresponding to a single frame of static image in the training and testing set, and training and testing the global time domain channel network; the parameters of the trained space channel network and the global time domain channel network are finely adjusted, and the category with the largest probability average value is used as the action recognition result of each frame of static image of the video sequence to be recognized, so that the action recognition accuracy of the method can reach more than 87%. The invention can be applied to the technical field of human body action recognition.

Description

Convolutional neural network human body action recognition method fusing global space-time characteristics

Technical Field

The invention belongs to the technical field of human body action recognition, and particularly relates to a convolutional neural network human body action recognition method fusing global space-time characteristics.

Background

Due to the great demands in the fields of human-computer interaction, intelligent traffic systems, video monitoring and the like, human motion recognition is increasingly emphasized by the field of computer vision. In order to enable a computer to recognize actions from different scenes, the core is to characterize the actions by discriminant features and then classify the actions. Different from static image recognition, besides the spatial motion feature, there is also a more important temporal motion feature, so how to effectively extract the spatial motion feature and the temporal motion feature of the motion is two main problems to be solved for human motion recognition.

Traditional motion recognition methods focus on manually extracting valid spatiotemporal features and then classifying the features using different classifiers. The first step of the manual feature-based motion recognition method is to extract local features, and among various appearance features, Histogram of Oriented Gradients (HOG) is widely studied due to its high robustness and efficiency to human spatial motion features. Inspired by HOG, Laptev et al combined HOG with optical flow, designed an optical flow Histogram (HOF). In addition, HOG is also extended to HOG-3D to extract spatio-temporal features. Wang and Schmid propose a Dense trajectory algorithm (DT) that fuses HOG, HOF, and Motion Boundary Histograms (MBH). On the basis, an improved dense trajectory algorithm (iDT) is proposed, which mainly introduces a background optical flow elimination method, so that the extracted motion features are more focused on the description of human actions. Meanwhile, Harris-3D, Hessian-3D and 3D-SIFT and the like are common local descriptors.

With the great success of CNNs in the field of image classification, attempts have been made to automatically learn motion characteristics from the original image through multi-layered convolutional and pooling layers. Compared with image classification, motion has temporal motion characteristics, CNNs used for motion recognition are usually complex, and most motion recognition methods based on CNNs are implemented according to two steps: first, spatial CNNs are established using static images and then they are fused in time, which results in loss of the temporal relationship between the motions, so Ji et al have designed a 3D-CNNs architecture, which proposes to extract temporal and spatial features of video data by 3D convolution kernels, and these 3D feature extractors operate in spatial and temporal dimensions, so that motion information of a video stream can be captured, but the accuracy of motion recognition is low.

Disclosure of Invention

The invention aims to solve the problem of low accuracy of motion recognition in the traditional motion recognition method.

The technical scheme adopted by the invention for solving the technical problems is as follows:

step one, selecting Inception V3 as a basic network structure, and establishing a spatial channel convolutional neural network;

step two, migrating the parameters of the first 10 layers of the pretrained Inception V3 basic network structure model on the ImageNet data set to the spatial channel convolutional neural network established in the step one; cutting a UCF101 video data set into single-frame static images, randomly dividing the cut single-frame static images into training set data and testing set data, and training and testing a spatial channel convolution neural network;

step three, collecting a video sequence to be recognized, cutting the video sequence to be recognized into each frame of static image to be used as training set and test set data, finely adjusting parameters of the space channel convolutional neural network trained in the step two, training and testing the space channel convolutional neural network by using each frame of static image of the training set and the test set, and outputting probability values P of all categories corresponding to each frame of static image of the video sequence to be recognized₁,P₂，…，P_N；

Establishing a global time domain channel convolution neural network, wherein the global time domain channel convolution neural network only adds a convolution layer with convolution kernel size of 3 multiplied by 3 behind an input layer of the space channel convolution neural network, and the rest network structure is the same as the space channel convolution neural network;

fifthly, training the global time domain channel convolution neural network established in the fourth step by utilizing the energy motion historical map corresponding to each frame of static image in the training set in the second step; testing a global time domain channel convolution neural network by utilizing an energy motion historical map corresponding to each frame of static image in the test set in the step two;

step six, after the parameters of the global time domain channel convolutional neural network trained in the step five are finely adjusted, the global time domain channel convolutional neural network is trained and tested by using the energy motion historical map corresponding to each frame of static image of the training set and the test set in the step three, and the probability value P of each category of the energy motion historical map corresponding to each frame of static image of the video sequence to be recognized is output₁′,P₂′，…，P_N′；

Step seven, respectively fusing the output of the spatial channel convolutional neural network corresponding to each frame of static image and the output of the global time domain channel convolutional neural network in the video sequence to be identified, namely calculating the probability average value of each category of each frame of static image

And taking the category with the maximum probability average value as the action recognition result of each frame of static image.

The invention has the beneficial effects that: the invention provides a convolutional neural network human body action recognition method fusing global space-time characteristics, which comprises the steps of establishing a space channel convolutional neural network and a global time domain channel convolutional neural network, and training and testing the established space channel convolutional neural network and the global time domain channel convolutional neural network by utilizing a UCF101 video data set; inputting each frame of static image of a video sequence to be recognized into a trained space channel convolutional neural network, finely adjusting network parameters, training and testing, and outputting probability values of all categories corresponding to each frame of static image of the video sequence to be recognized; sequentially inputting the energy motion historical maps corresponding to each frame of image of the video sequence to be recognized into a trained global time domain channel convolutional neural network for training and testing, and outputting probability values of all categories of the energy motion historical maps corresponding to each frame of static image of the video sequence to be recognized; then, the output results of the spatial channel convolutional neural network and the global time domain channel convolutional neural network are fused to obtain the action recognition result of each frame of static image in the video sequence to be recognized; compared with the traditional action recognition method, the action recognition method has the advantage that the recognition accuracy can be improved to more than 87%.

The invention integrates the space and time characteristics of human body actions and plays a good role in identifying the human body actions.

Drawings

FIG. 1 is a flow chart of a convolutional neural network human body action recognition method with global spatiotemporal features fused according to the present invention;

FIG. 2 is a schematic diagram of spatial channel multiframe fusion according to the present invention;

the output fusion of 3 frames of static images is shown;

FIG. 3 is a schematic diagram of a global time domain channel input configuration according to the present invention;

wherein: 299 × 299 × 1 is the input layer, and 299 × 299 × 3 is the result after the convolutional layer.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.

The first embodiment is as follows: this embodiment will be described with reference to fig. 1. The convolutional neural network human body action recognition method fusing global space-time characteristics, which is described in the embodiment, comprises the following specific steps:

The convolutional neural network fusing the global space-time characteristics can better extract the space-time information of the action.

The second embodiment is as follows: the embodiment further defines the convolutional neural network human body action recognition method fusing the global space-time characteristics, which is described in the first embodiment, and the specific process of the first step in the embodiment is as follows:

selecting Inception V3 as a basic network structure, removing the last full-connection layer of the basic network structure, and sequentially adding a full-connection layer with 1024 neurons, a full-connection layer with 256 neurons and a full-connection layer with N action types of neurons from front to back.

In the present embodiment, the activation function of the fully-connected layer having the number of neurons of 1024 and the fully-connected layer having the number of neurons of 256 is relu, and the activation function of the fully-connected layer having the number of neurons of 10 action types is softmax.

The third concrete implementation mode: the embodiment further defines the convolutional neural network human body action recognition method fusing the global space-time characteristics, which is described in the second embodiment, and the specific process of the second step in the embodiment is as follows:

migrating parameters of the first 10 layers of an inclusion V3 basic network structure model pre-trained on an ImageNet data set, namely migrating parameters of a 1 st convolution layer of the model to a 3 rd inclusion module to a spatial channel convolution neural network established in the first step, cutting a UCF101 video data set into standard input single-frame static images with the size of 299 multiplied by 299, randomly dividing the cut single-frame static images into training sets and test set data, sequentially inputting the static images in the training sets into the spatial channel convolution neural network, training by adopting an Adam gradient descent method, setting the size of mini-batch to be 32, adopting Keras default parameters as the parameters, and stopping training if the identification accuracy of the static images in the test set is not increased for at least 10 times continuously.

The fourth concrete implementation mode: the embodiment further defines the convolutional neural network human body action recognition method fusing global space-time characteristics, in the third step of the embodiment, a fall action data set is collected as a video sequence to be recognized, the video sequence to be recognized comprises actions of falling, walking and sitting, each action comprises M video sequences, the M video sequences are randomly divided into a training set and a test set, and each video sequence is cut into K frames of static images;

finely adjusting parameters of the spatial channel convolutional neural network, namely modifying the output class of the last layer of the spatial channel convolutional neural network to be 3;

sequentially inputting the static images of the training set into the spatial channel convolutional neural network after parameter fine adjustment, training the last full-link layer by adopting an Adam gradient descent method, training the last two full-link layers by adopting a random gradient descent method after at least 10 epochs are trained, setting the learning rate to be 0.0001 and setting the Momentum to be 0.9, and stopping training if the identification accuracy of the static images of the test set is not increased for at least 10 continuous times;

performing action recognition in a spatial channel convolutional neural network by adopting a multi-frame fusion mode, and averaging the output of the input current frame static image and the output of the previous frame static image; outputting probability values P of 3 categories corresponding to each frame of static image of video sequence to be recognized₁,P₂And P₃。

In the present embodiment, the multi-frame fusion scheme is: if the input static image of the current frame is the nth frame, the probabilities of the three categories of the output result are respectively Pⁿ ₁,Pⁿ ₂And Pⁿ ₃Then, taking the average means: pⁿ ₁,Pⁿ ₂And Pⁿ ₃The individual output result of the current frame is averaged with the output result of the previous n-1 frame.

Because the motion is a three-dimensional space-time signal, if the spatial channel only takes the output of the current frame as the basis for judgment, a large error may occur, so the embodiment adopts a multi-frame fusion mode to identify the motion in the spatial channel, and weights and averages the identification results of the current frame and the previous fixed frame. As shown in fig. 2, the outputs of the current frame and the previous 2 frames are fused, and although the current frame is identified incorrectly, the correct result is finally output through the correction of the previous 2 frames, so that the identification accuracy is improved.

The fifth concrete implementation mode: the embodiment further defines the convolutional neural network human body action recognition method fusing the global space-time characteristics, which is described in the fourth embodiment, and the specific process of the fifth step in the embodiment is as follows:

sequentially inputting the energy motion historical maps of the single-frame static images of the training set in the second step into the established global time domain channel convolutional neural network, training the global time domain channel convolutional neural network by adopting an Adam gradient descent method, setting the mini-batch size to be 32, adopting Keras default parameters as parameters, and stopping training if the action recognition accuracy of the test set is not increased for at least 10 times continuously;

the gray value of a pixel point with coordinates (x, y) in the energy motion history map corresponding to the t-th frame of static image is H_τ(x, y, t), according to the update function, resulting in:

in the formula: (x, y) is the position of the pixel point in the energy motion history map corresponding to the t-th frame of static image, and max represents taking 0 and H_τThe larger of (x, y, t-1) -delta, H_τ(x, y, t-1) is the gray value of a pixel point with the coordinate (x, y) in the energy motion history map corresponding to the static image of the t-1 th frame; τ is duration, δ is decay parameter;

psi (x, y, t) is an updating function, whether each pixel point is in the foreground of the current frame is judged, if so, psi (x, y, t) is equal to 1, otherwise psi (x, y, t) is equal to 0;

ψ (x, y, t) is obtained by an inter-frame difference method:

D(x,y,t)＝|I(x,y,t)-I(x,y,te)|

in the formula: i (x, y, t) is the gray value of the pixel point positioned at the (x, y) coordinate in the t-th frame static image; i (x, y, t)_e) The gray value of the pixel point positioned at the (x, y) coordinate in the static image of the previous effective frame; ξ is the threshold used to discriminate between foreground and background; d (x, y, t) is I (x, y, t) and I (x, y, t)_e) The absolute value of the difference of (a);

the process of calculating the energy motion history map comprises the following steps:

if the static image of the current frame is an effective frame, updating the energy motion historical map once, otherwise, not updating;

the judgment principle of the effective frame is as follows: setting a first frame static image as an effective frame, and if the motion energy of the current frame static image relative to the previous effective frame static image is greater than a threshold value mu, setting the current frame as the effective frame;

definition E_tFor the t-th frame of the static image I_tStatic image It with respect to the previous valid frame_eThe motion energy of (2):

wherein: c is the number of pixel points with displacement of the t frame static image relative to the previous effective frame static image; h and w are the width and height of the t-th frame of still image, respectively; d_t(x, y) is the displacement of the pixel point (x, y) in the t-th frame static image relative to the previous effective frame static image;

is the displacement of the pixel point (x, y) between the t-th frame static image and the previous effective frame static image in the horizontal direction,

is the displacement of the pixel point (x, y) between the t frame static image and the previous effective frame static image in the vertical direction;

computing a global dense optical flow:

in the formula:

is the optical flow in the horizontal direction and the vertical direction between the t-th frame static image and the previous effective frame static image; CalcOpticalFlowFarneback is an optical flow function.

EMHI is a vision-based template, and represents human body action in the form of image gray values by calculating the pixel change of the same position in a period of time. Considering that many actions span many frames, if each frame is used to update the emii, the earlier actions will be disabled, so a valid frame based approach is proposed for updating.

In essence, the displacement of a pixel is used to determine whether the frame is a valid frame, but it is not feasible to simply sum the displacements of all pixels in the image. Because the visual angles are different, the proportion of the moving person in the image is different, and the person close to the lens can obtain great movement energy by doing a small motion, so the influence of the visual angles is eliminated by dividing by the number of effective pixels.

The sixth specific implementation mode: the embodiment further defines the human body action recognition method of the convolutional neural network fusing the global space-time characteristics, in the fifth embodiment, the sixth step of the embodiment finely adjusts the parameters of the global time domain channel convolutional neural network according to the tumble action data set collected in the third step, that is, the output category of the last layer of the global time domain channel convolutional neural network is modified to be 3;

sequentially inputting the energy motion history graphs corresponding to each frame of static image in the training set into a global time domain channel convolutional neural network after parameter fine adjustment, training the last full-link layer by adopting an Adam gradient descent method, training the last two full-link layers by adopting a random gradient descent method after at least 10 epochs are trained, setting the learning rate to be 0.0001 and setting the Momentum to be 0.9, and stopping training if the identification accuracy of the energy motion history graphs in the test set is not increased for at least 10 continuous times; outputting probability values P of 3 categories of energy motion historical maps corresponding to each frame of static images of video sequence to be recognized₁′,P₂' and P₃′。

The seventh embodiment: the embodiment further defines the convolutional neural network human body action recognition method fusing the global space-time characteristics, and calculates the probability average value of each category of each frame of static image of the falling action data set

And

Examples

The UCF101 database is selected to judge the recognition effect, the UCF101 database comprises 13320 videos of 101 actions, and the scenes of the actions are complex. The trained network is then migrated to a small sample dataset in the project herein.

The invention designs two-channel CNNs, the basic network structures of a space channel convolutional neural network and a global time domain channel convolutional neural network both adopt an inclusion V3 basic network structure, the input of the space channel convolutional neural network is a single-frame static image, the input of the global time domain channel convolutional neural network is a motion energy historical map (EMHI) of the single-frame image, a method for independently training the two channels is adopted, and finally, the output results of the two channels are fused to identify the action of a human body.

And (3) training a UCF101 space channel data set to a higher recognition rate, then transferring to a small sample data set for fine adjustment, and selecting continuous 30 frames in each video sequence for evaluation in a test set.

The test results are shown in table 1, in the UCF101 data set, the spatial channel identification accuracy is 70.2%, and the accuracy is respectively improved to 70.9%, 71.3% and 71.5% by using a multi-frame fusion mode. The method has better performance on a small sample data set, the spatial channel identification accuracy is 73.4%, and the accuracy is respectively improved to 74.7%, 74.9% and 75.1% by utilizing a multi-frame fusion mode. The small sample dataset has only 3 types of actions, which are much less than the UCF101 dataset, so the error is smaller. The identification accuracy can be improved and the error can be reduced by the multi-frame fusion method, so that the effectiveness of the multi-frame fusion method is proved.

TABLE 1 average recognition rate of spatial channels

Respectively calculating MHI and EMHI by using a video data set to serve as global time domain channel training data sets, training the data sets on the UCF101 global time domain channel to a higher recognition rate, transferring the data sets to a small sample data set for fine tuning, and comparing the recognition effects of the MHI and the EMHI respectively in the same test method as the space channel. Since the input of our global time domain channel is a grey-scale map of a single channel, while the input of the time domain channel is an RGB map. As shown in FIG. 3, the present invention adds one more convolutional layer after the input layer, the number of convolutional kernels is 3, and the boundary adopts a method of 0 complement, so as to satisfy the input layer structure of the time domain channel.

As shown in table 2, in the UCF101 data set, the accuracy of motion recognition using MHI was 75.8%, and the motion recognition rate of EMHI was 78.3%. The MHI action recognition accuracy was 78.4% and the emii action recognition rate was 80.2% on a small sample dataset. In general, the action recognition accuracy of the EMHI is higher than that of the MHI, and the effectiveness of the EMHI in action recognition is verified.

TABLE 2 Global time Domain channel average identification Rate

And fusing the identification results of the spatial channel convolutional network and the global time domain channel convolutional network, wherein the test methods are the same. As shown in table 3, the average recognition rate in the UCF101 data set was 85.2%, and the average recognition rate in the small sample data set was 87.2%. It can be seen that the depth feature learning capabilities of the spatial channel and the global temporal channel are complementary to each other.

TABLE 3 two-channel average recognition rate

The invention provides a two-channel convolutional neural network human body action recognition framework based on space and global time domain characteristics, which can well extract deep characteristics of human body action information. The spatial channel is identified in a multi-frame fusion mode, and experimental results show that the method can effectively improve the identification accuracy of the spatial channel; compared with the traditional MHI, the global time domain channel can more effectively extract global action time domain characteristics by adopting the EMHI with self-adaptive capacity based on motion energy provided by the invention. The two channels are comprehensively identified by adopting an average fusion mode, and the experimental result shows that the two channels are mutually complementary, so that the accuracy of action identification is improved. In addition, the method provided by the invention utilizes a large action data set for pre-training, and the movement to a small sample data set shows better identification precision, thereby verifying the effectiveness of the method.

Claims

1. A convolutional neural network human body action recognition method fused with global space-time characteristics is characterized by comprising the following specific steps:

step three, collecting a video sequence to be recognized, cutting the video sequence to be recognized into each frame of static image to be used as training set and test set data, finely adjusting parameters of the spatial channel convolutional neural network trained in the step two, training and testing the spatial channel convolutional neural network by using each frame of static image of the training set and the test set, and outputting the probability of each category corresponding to each frame of static image of the video sequence to be recognizedValue P₁,P₂，…，P_N；

2. The method for recognizing the human body action of the convolutional neural network fused with the global spatiotemporal features as claimed in claim 1, wherein the specific process of the step one is as follows:

3. The method for recognizing the human body action of the convolutional neural network fused with the global spatiotemporal features as claimed in claim 2, wherein the specific process of the second step is as follows:

4. The method for recognizing human body actions through the convolutional neural network fused with global spatio-temporal features as claimed in claim 3, wherein in the third step, a data set of falling actions is collected as a video sequence to be recognized, the video sequence to be recognized comprises actions of falling, walking and sitting, each action comprises M video sequences, the M video sequences are randomly divided into a training set and a testing set, and each video sequence is cut into K frames of static images;

5. The method for identifying human body actions by using convolutional neural network fused with global spatiotemporal features as claimed in claim 4, wherein the specific process of the fifth step is as follows:

ψ (x, y, t) is obtained by an inter-frame difference method:

D(x,y,t)＝|I(x,y,t)-I(x,y,t_e)|

definition E_tFor the t-th frame of the static image I_tStatic image with respect to previous active frame

The motion energy of (2):

computing a global dense optical flow:

in the formula:

6. The method for recognizing the human body actions of the convolutional neural network fused with the global space-time characteristics as claimed in claim 5, wherein in the sixth step, the parameters of the global time domain channel convolutional neural network are finely adjusted according to the fall action data sets collected in the third step, that is, the output category of the last layer of the global time domain channel convolutional neural network is modified to be 3;

7. The method of claim 6, wherein the mean value of the probability of each category of each frame of static image of the fall motion data set is calculated

And