CN108830252B - Convolutional neural network human body action recognition method fusing global space-time characteristics - Google Patents

Convolutional neural network human body action recognition method fusing global space-time characteristics Download PDF

Info

Publication number
CN108830252B
CN108830252B CN201810671262.9A CN201810671262A CN108830252B CN 108830252 B CN108830252 B CN 108830252B CN 201810671262 A CN201810671262 A CN 201810671262A CN 108830252 B CN108830252 B CN 108830252B
Authority
CN
China
Prior art keywords
frame
neural network
static image
convolutional neural
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810671262.9A
Other languages
Chinese (zh)
Other versions
CN108830252A (en
Inventor
李瑞峰
王珂
程宝平
武军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
Harbin Institute of Technology
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology, China Mobile Hangzhou Information Technology Co Ltd filed Critical Harbin Institute of Technology
Priority to CN201810671262.9A priority Critical patent/CN108830252B/en
Publication of CN108830252A publication Critical patent/CN108830252A/en
Application granted granted Critical
Publication of CN108830252B publication Critical patent/CN108830252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A convolutional neural network human body action recognition method fusing global space-time characteristics belongs to the technical field of human body action recognition. The invention solves the problem of low accuracy of motion recognition in the traditional motion recognition method. The method selects an Inception V3 basic network structure, establishes a space channel network and a global time domain channel network, cuts a UCF101 video data set into single-frame static images, and divides the single-frame static images into a training and testing set to train and test the space channel network; calculating an energy motion history map corresponding to a single frame of static image in the training and testing set, and training and testing the global time domain channel network; the parameters of the trained space channel network and the global time domain channel network are finely adjusted, and the category with the largest probability average value is used as the action recognition result of each frame of static image of the video sequence to be recognized, so that the action recognition accuracy of the method can reach more than 87%. The invention can be applied to the technical field of human body action recognition.

Description

Convolutional neural network human body action recognition method fusing global space-time characteristics
Technical Field
The invention belongs to the technical field of human body action recognition, and particularly relates to a convolutional neural network human body action recognition method fusing global space-time characteristics.
Background
Due to the great demands in the fields of human-computer interaction, intelligent traffic systems, video monitoring and the like, human motion recognition is increasingly emphasized by the field of computer vision. In order to enable a computer to recognize actions from different scenes, the core is to characterize the actions by discriminant features and then classify the actions. Different from static image recognition, besides the spatial motion feature, there is also a more important temporal motion feature, so how to effectively extract the spatial motion feature and the temporal motion feature of the motion is two main problems to be solved for human motion recognition.
Traditional motion recognition methods focus on manually extracting valid spatiotemporal features and then classifying the features using different classifiers. The first step of the manual feature-based motion recognition method is to extract local features, and among various appearance features, Histogram of Oriented Gradients (HOG) is widely studied due to its high robustness and efficiency to human spatial motion features. Inspired by HOG, Laptev et al combined HOG with optical flow, designed an optical flow Histogram (HOF). In addition, HOG is also extended to HOG-3D to extract spatio-temporal features. Wang and Schmid propose a Dense trajectory algorithm (DT) that fuses HOG, HOF, and Motion Boundary Histograms (MBH). On the basis, an improved dense trajectory algorithm (iDT) is proposed, which mainly introduces a background optical flow elimination method, so that the extracted motion features are more focused on the description of human actions. Meanwhile, Harris-3D, Hessian-3D and 3D-SIFT and the like are common local descriptors.
With the great success of CNNs in the field of image classification, attempts have been made to automatically learn motion characteristics from the original image through multi-layered convolutional and pooling layers. Compared with image classification, motion has temporal motion characteristics, CNNs used for motion recognition are usually complex, and most motion recognition methods based on CNNs are implemented according to two steps: first, spatial CNNs are established using static images and then they are fused in time, which results in loss of the temporal relationship between the motions, so Ji et al have designed a 3D-CNNs architecture, which proposes to extract temporal and spatial features of video data by 3D convolution kernels, and these 3D feature extractors operate in spatial and temporal dimensions, so that motion information of a video stream can be captured, but the accuracy of motion recognition is low.
Disclosure of Invention
The invention aims to solve the problem of low accuracy of motion recognition in the traditional motion recognition method.
The technical scheme adopted by the invention for solving the technical problems is as follows:
step one, selecting Inception V3 as a basic network structure, and establishing a spatial channel convolutional neural network;
step two, migrating the parameters of the first 10 layers of the pretrained Inception V3 basic network structure model on the ImageNet data set to the spatial channel convolutional neural network established in the step one; cutting a UCF101 video data set into single-frame static images, randomly dividing the cut single-frame static images into training set data and testing set data, and training and testing a spatial channel convolution neural network;
step three, collecting a video sequence to be recognized, cutting the video sequence to be recognized into each frame of static image to be used as training set and test set data, finely adjusting parameters of the space channel convolutional neural network trained in the step two, training and testing the space channel convolutional neural network by using each frame of static image of the training set and the test set, and outputting probability values P of all categories corresponding to each frame of static image of the video sequence to be recognized1,P2,…,PN
Establishing a global time domain channel convolution neural network, wherein the global time domain channel convolution neural network only adds a convolution layer with convolution kernel size of 3 multiplied by 3 behind an input layer of the space channel convolution neural network, and the rest network structure is the same as the space channel convolution neural network;
fifthly, training the global time domain channel convolution neural network established in the fourth step by utilizing the energy motion historical map corresponding to each frame of static image in the training set in the second step; testing a global time domain channel convolution neural network by utilizing an energy motion historical map corresponding to each frame of static image in the test set in the step two;
step six, after the parameters of the global time domain channel convolutional neural network trained in the step five are finely adjusted, the global time domain channel convolutional neural network is trained and tested by using the energy motion historical map corresponding to each frame of static image of the training set and the test set in the step three, and the probability value P of each category of the energy motion historical map corresponding to each frame of static image of the video sequence to be recognized is output1′,P2′,…,PN′;
Step seven, respectively fusing the output of the spatial channel convolutional neural network corresponding to each frame of static image and the output of the global time domain channel convolutional neural network in the video sequence to be identified, namely calculating the probability average value of each category of each frame of static image
Figure BDA0001708273240000021
And taking the category with the maximum probability average value as the action recognition result of each frame of static image.
The invention has the beneficial effects that: the invention provides a convolutional neural network human body action recognition method fusing global space-time characteristics, which comprises the steps of establishing a space channel convolutional neural network and a global time domain channel convolutional neural network, and training and testing the established space channel convolutional neural network and the global time domain channel convolutional neural network by utilizing a UCF101 video data set; inputting each frame of static image of a video sequence to be recognized into a trained space channel convolutional neural network, finely adjusting network parameters, training and testing, and outputting probability values of all categories corresponding to each frame of static image of the video sequence to be recognized; sequentially inputting the energy motion historical maps corresponding to each frame of image of the video sequence to be recognized into a trained global time domain channel convolutional neural network for training and testing, and outputting probability values of all categories of the energy motion historical maps corresponding to each frame of static image of the video sequence to be recognized; then, the output results of the spatial channel convolutional neural network and the global time domain channel convolutional neural network are fused to obtain the action recognition result of each frame of static image in the video sequence to be recognized; compared with the traditional action recognition method, the action recognition method has the advantage that the recognition accuracy can be improved to more than 87%.
The invention integrates the space and time characteristics of human body actions and plays a good role in identifying the human body actions.
Drawings
FIG. 1 is a flow chart of a convolutional neural network human body action recognition method with global spatiotemporal features fused according to the present invention;
FIG. 2 is a schematic diagram of spatial channel multiframe fusion according to the present invention;
the output fusion of 3 frames of static images is shown;
FIG. 3 is a schematic diagram of a global time domain channel input configuration according to the present invention;
wherein: 299 × 299 × 1 is the input layer, and 299 × 299 × 3 is the result after the convolutional layer.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.
The first embodiment is as follows: this embodiment will be described with reference to fig. 1. The convolutional neural network human body action recognition method fusing global space-time characteristics, which is described in the embodiment, comprises the following specific steps:
step one, selecting Inception V3 as a basic network structure, and establishing a spatial channel convolutional neural network;
step two, migrating the parameters of the first 10 layers of the pretrained Inception V3 basic network structure model on the ImageNet data set to the spatial channel convolutional neural network established in the step one; cutting a UCF101 video data set into single-frame static images, randomly dividing the cut single-frame static images into training set data and testing set data, and training and testing a spatial channel convolution neural network;
step three, collecting a video sequence to be recognized, cutting the video sequence to be recognized into each frame of static image to be used as training set and test set data, finely adjusting parameters of the space channel convolutional neural network trained in the step two, training and testing the space channel convolutional neural network by using each frame of static image of the training set and the test set, and outputting probability values P of all categories corresponding to each frame of static image of the video sequence to be recognized1,P2,…,PN
Establishing a global time domain channel convolution neural network, wherein the global time domain channel convolution neural network only adds a convolution layer with convolution kernel size of 3 multiplied by 3 behind an input layer of the space channel convolution neural network, and the rest network structure is the same as the space channel convolution neural network;
fifthly, training the global time domain channel convolution neural network established in the fourth step by utilizing the energy motion historical map corresponding to each frame of static image in the training set in the second step; testing a global time domain channel convolution neural network by utilizing an energy motion historical map corresponding to each frame of static image in the test set in the step two;
step six, after the parameters of the global time domain channel convolutional neural network trained in the step five are finely adjusted, the global time domain channel convolutional neural network is trained and tested by using the energy motion historical map corresponding to each frame of static image of the training set and the test set in the step three, and the probability value P of each category of the energy motion historical map corresponding to each frame of static image of the video sequence to be recognized is output1′,P2′,…,PN′;
Step seven, respectively fusing the output of the spatial channel convolutional neural network corresponding to each frame of static image and the output of the global time domain channel convolutional neural network in the video sequence to be identified, namely calculating the probability average value of each category of each frame of static image
Figure BDA0001708273240000041
And taking the category with the maximum probability average value as the action recognition result of each frame of static image.
The convolutional neural network fusing the global space-time characteristics can better extract the space-time information of the action.
The second embodiment is as follows: the embodiment further defines the convolutional neural network human body action recognition method fusing the global space-time characteristics, which is described in the first embodiment, and the specific process of the first step in the embodiment is as follows:
selecting Inception V3 as a basic network structure, removing the last full-connection layer of the basic network structure, and sequentially adding a full-connection layer with 1024 neurons, a full-connection layer with 256 neurons and a full-connection layer with N action types of neurons from front to back.
In the present embodiment, the activation function of the fully-connected layer having the number of neurons of 1024 and the fully-connected layer having the number of neurons of 256 is relu, and the activation function of the fully-connected layer having the number of neurons of 10 action types is softmax.
The third concrete implementation mode: the embodiment further defines the convolutional neural network human body action recognition method fusing the global space-time characteristics, which is described in the second embodiment, and the specific process of the second step in the embodiment is as follows:
migrating parameters of the first 10 layers of an inclusion V3 basic network structure model pre-trained on an ImageNet data set, namely migrating parameters of a 1 st convolution layer of the model to a 3 rd inclusion module to a spatial channel convolution neural network established in the first step, cutting a UCF101 video data set into standard input single-frame static images with the size of 299 multiplied by 299, randomly dividing the cut single-frame static images into training sets and test set data, sequentially inputting the static images in the training sets into the spatial channel convolution neural network, training by adopting an Adam gradient descent method, setting the size of mini-batch to be 32, adopting Keras default parameters as the parameters, and stopping training if the identification accuracy of the static images in the test set is not increased for at least 10 times continuously.
The fourth concrete implementation mode: the embodiment further defines the convolutional neural network human body action recognition method fusing global space-time characteristics, in the third step of the embodiment, a fall action data set is collected as a video sequence to be recognized, the video sequence to be recognized comprises actions of falling, walking and sitting, each action comprises M video sequences, the M video sequences are randomly divided into a training set and a test set, and each video sequence is cut into K frames of static images;
finely adjusting parameters of the spatial channel convolutional neural network, namely modifying the output class of the last layer of the spatial channel convolutional neural network to be 3;
sequentially inputting the static images of the training set into the spatial channel convolutional neural network after parameter fine adjustment, training the last full-link layer by adopting an Adam gradient descent method, training the last two full-link layers by adopting a random gradient descent method after at least 10 epochs are trained, setting the learning rate to be 0.0001 and setting the Momentum to be 0.9, and stopping training if the identification accuracy of the static images of the test set is not increased for at least 10 continuous times;
performing action recognition in a spatial channel convolutional neural network by adopting a multi-frame fusion mode, and averaging the output of the input current frame static image and the output of the previous frame static image; outputting probability values P of 3 categories corresponding to each frame of static image of video sequence to be recognized1,P2And P3
In the present embodiment, the multi-frame fusion scheme is: if the input static image of the current frame is the nth frame, the probabilities of the three categories of the output result are respectively Pn 1,Pn 2And Pn 3Then, taking the average means: pn 1,Pn 2And Pn 3The individual output result of the current frame is averaged with the output result of the previous n-1 frame.
Because the motion is a three-dimensional space-time signal, if the spatial channel only takes the output of the current frame as the basis for judgment, a large error may occur, so the embodiment adopts a multi-frame fusion mode to identify the motion in the spatial channel, and weights and averages the identification results of the current frame and the previous fixed frame. As shown in fig. 2, the outputs of the current frame and the previous 2 frames are fused, and although the current frame is identified incorrectly, the correct result is finally output through the correction of the previous 2 frames, so that the identification accuracy is improved.
The fifth concrete implementation mode: the embodiment further defines the convolutional neural network human body action recognition method fusing the global space-time characteristics, which is described in the fourth embodiment, and the specific process of the fifth step in the embodiment is as follows:
sequentially inputting the energy motion historical maps of the single-frame static images of the training set in the second step into the established global time domain channel convolutional neural network, training the global time domain channel convolutional neural network by adopting an Adam gradient descent method, setting the mini-batch size to be 32, adopting Keras default parameters as parameters, and stopping training if the action recognition accuracy of the test set is not increased for at least 10 times continuously;
the gray value of a pixel point with coordinates (x, y) in the energy motion history map corresponding to the t-th frame of static image is Hτ(x, y, t), according to the update function, resulting in:
Figure BDA0001708273240000061
in the formula: (x, y) is the position of the pixel point in the energy motion history map corresponding to the t-th frame of static image, and max represents taking 0 and HτThe larger of (x, y, t-1) -delta, Hτ(x, y, t-1) is the gray value of a pixel point with the coordinate (x, y) in the energy motion history map corresponding to the static image of the t-1 th frame; τ is duration, δ is decay parameter;
psi (x, y, t) is an updating function, whether each pixel point is in the foreground of the current frame is judged, if so, psi (x, y, t) is equal to 1, otherwise psi (x, y, t) is equal to 0;
ψ (x, y, t) is obtained by an inter-frame difference method:
Figure BDA0001708273240000062
D(x,y,t)=|I(x,y,t)-I(x,y,te)|
in the formula: i (x, y, t) is the gray value of the pixel point positioned at the (x, y) coordinate in the t-th frame static image; i (x, y, t)e) The gray value of the pixel point positioned at the (x, y) coordinate in the static image of the previous effective frame; ξ is the threshold used to discriminate between foreground and background; d (x, y, t) is I (x, y, t) and I (x, y, t)e) The absolute value of the difference of (a);
the process of calculating the energy motion history map comprises the following steps:
if the static image of the current frame is an effective frame, updating the energy motion historical map once, otherwise, not updating;
the judgment principle of the effective frame is as follows: setting a first frame static image as an effective frame, and if the motion energy of the current frame static image relative to the previous effective frame static image is greater than a threshold value mu, setting the current frame as the effective frame;
definition EtFor the t-th frame of the static image ItStatic image It with respect to the previous valid frameeThe motion energy of (2):
Figure BDA0001708273240000063
Figure BDA0001708273240000064
wherein: c is the number of pixel points with displacement of the t frame static image relative to the previous effective frame static image; h and w are the width and height of the t-th frame of still image, respectively; dt(x, y) is the displacement of the pixel point (x, y) in the t-th frame static image relative to the previous effective frame static image;
Figure BDA0001708273240000065
is the displacement of the pixel point (x, y) between the t-th frame static image and the previous effective frame static image in the horizontal direction,
Figure BDA0001708273240000066
is the displacement of the pixel point (x, y) between the t frame static image and the previous effective frame static image in the vertical direction;
computing a global dense optical flow:
Figure BDA0001708273240000071
in the formula:
Figure BDA0001708273240000072
is the optical flow in the horizontal direction and the vertical direction between the t-th frame static image and the previous effective frame static image; CalcOpticalFlowFarneback is an optical flow function.
EMHI is a vision-based template, and represents human body action in the form of image gray values by calculating the pixel change of the same position in a period of time. Considering that many actions span many frames, if each frame is used to update the emii, the earlier actions will be disabled, so a valid frame based approach is proposed for updating.
In essence, the displacement of a pixel is used to determine whether the frame is a valid frame, but it is not feasible to simply sum the displacements of all pixels in the image. Because the visual angles are different, the proportion of the moving person in the image is different, and the person close to the lens can obtain great movement energy by doing a small motion, so the influence of the visual angles is eliminated by dividing by the number of effective pixels.
The sixth specific implementation mode: the embodiment further defines the human body action recognition method of the convolutional neural network fusing the global space-time characteristics, in the fifth embodiment, the sixth step of the embodiment finely adjusts the parameters of the global time domain channel convolutional neural network according to the tumble action data set collected in the third step, that is, the output category of the last layer of the global time domain channel convolutional neural network is modified to be 3;
sequentially inputting the energy motion history graphs corresponding to each frame of static image in the training set into a global time domain channel convolutional neural network after parameter fine adjustment, training the last full-link layer by adopting an Adam gradient descent method, training the last two full-link layers by adopting a random gradient descent method after at least 10 epochs are trained, setting the learning rate to be 0.0001 and setting the Momentum to be 0.9, and stopping training if the identification accuracy of the energy motion history graphs in the test set is not increased for at least 10 continuous times; outputting probability values P of 3 categories of energy motion historical maps corresponding to each frame of static images of video sequence to be recognized1′,P2' and P3′。
The seventh embodiment: the embodiment further defines the convolutional neural network human body action recognition method fusing the global space-time characteristics, and calculates the probability average value of each category of each frame of static image of the falling action data set
Figure BDA0001708273240000073
And
Figure BDA0001708273240000074
and taking the category with the maximum probability average value as the action recognition result of each frame of static image.
Examples
The UCF101 database is selected to judge the recognition effect, the UCF101 database comprises 13320 videos of 101 actions, and the scenes of the actions are complex. The trained network is then migrated to a small sample dataset in the project herein.
The invention designs two-channel CNNs, the basic network structures of a space channel convolutional neural network and a global time domain channel convolutional neural network both adopt an inclusion V3 basic network structure, the input of the space channel convolutional neural network is a single-frame static image, the input of the global time domain channel convolutional neural network is a motion energy historical map (EMHI) of the single-frame image, a method for independently training the two channels is adopted, and finally, the output results of the two channels are fused to identify the action of a human body.
And (3) training a UCF101 space channel data set to a higher recognition rate, then transferring to a small sample data set for fine adjustment, and selecting continuous 30 frames in each video sequence for evaluation in a test set.
The test results are shown in table 1, in the UCF101 data set, the spatial channel identification accuracy is 70.2%, and the accuracy is respectively improved to 70.9%, 71.3% and 71.5% by using a multi-frame fusion mode. The method has better performance on a small sample data set, the spatial channel identification accuracy is 73.4%, and the accuracy is respectively improved to 74.7%, 74.9% and 75.1% by utilizing a multi-frame fusion mode. The small sample dataset has only 3 types of actions, which are much less than the UCF101 dataset, so the error is smaller. The identification accuracy can be improved and the error can be reduced by the multi-frame fusion method, so that the effectiveness of the multi-frame fusion method is proved.
TABLE 1 average recognition rate of spatial channels
Figure BDA0001708273240000081
Respectively calculating MHI and EMHI by using a video data set to serve as global time domain channel training data sets, training the data sets on the UCF101 global time domain channel to a higher recognition rate, transferring the data sets to a small sample data set for fine tuning, and comparing the recognition effects of the MHI and the EMHI respectively in the same test method as the space channel. Since the input of our global time domain channel is a grey-scale map of a single channel, while the input of the time domain channel is an RGB map. As shown in FIG. 3, the present invention adds one more convolutional layer after the input layer, the number of convolutional kernels is 3, and the boundary adopts a method of 0 complement, so as to satisfy the input layer structure of the time domain channel.
As shown in table 2, in the UCF101 data set, the accuracy of motion recognition using MHI was 75.8%, and the motion recognition rate of EMHI was 78.3%. The MHI action recognition accuracy was 78.4% and the emii action recognition rate was 80.2% on a small sample dataset. In general, the action recognition accuracy of the EMHI is higher than that of the MHI, and the effectiveness of the EMHI in action recognition is verified.
TABLE 2 Global time Domain channel average identification Rate
Figure BDA0001708273240000091
And fusing the identification results of the spatial channel convolutional network and the global time domain channel convolutional network, wherein the test methods are the same. As shown in table 3, the average recognition rate in the UCF101 data set was 85.2%, and the average recognition rate in the small sample data set was 87.2%. It can be seen that the depth feature learning capabilities of the spatial channel and the global temporal channel are complementary to each other.
TABLE 3 two-channel average recognition rate
Figure BDA0001708273240000092
The invention provides a two-channel convolutional neural network human body action recognition framework based on space and global time domain characteristics, which can well extract deep characteristics of human body action information. The spatial channel is identified in a multi-frame fusion mode, and experimental results show that the method can effectively improve the identification accuracy of the spatial channel; compared with the traditional MHI, the global time domain channel can more effectively extract global action time domain characteristics by adopting the EMHI with self-adaptive capacity based on motion energy provided by the invention. The two channels are comprehensively identified by adopting an average fusion mode, and the experimental result shows that the two channels are mutually complementary, so that the accuracy of action identification is improved. In addition, the method provided by the invention utilizes a large action data set for pre-training, and the movement to a small sample data set shows better identification precision, thereby verifying the effectiveness of the method.

Claims (7)

1. A convolutional neural network human body action recognition method fused with global space-time characteristics is characterized by comprising the following specific steps:
step one, selecting Inception V3 as a basic network structure, and establishing a spatial channel convolutional neural network;
step two, migrating the parameters of the first 10 layers of the pretrained Inception V3 basic network structure model on the ImageNet data set to the spatial channel convolutional neural network established in the step one; cutting a UCF101 video data set into single-frame static images, randomly dividing the cut single-frame static images into training set data and testing set data, and training and testing a spatial channel convolution neural network;
step three, collecting a video sequence to be recognized, cutting the video sequence to be recognized into each frame of static image to be used as training set and test set data, finely adjusting parameters of the spatial channel convolutional neural network trained in the step two, training and testing the spatial channel convolutional neural network by using each frame of static image of the training set and the test set, and outputting the probability of each category corresponding to each frame of static image of the video sequence to be recognizedValue P1,P2,…,PN
Establishing a global time domain channel convolution neural network, wherein the global time domain channel convolution neural network only adds a convolution layer with convolution kernel size of 3 multiplied by 3 behind an input layer of the space channel convolution neural network, and the rest network structure is the same as the space channel convolution neural network;
fifthly, training the global time domain channel convolution neural network established in the fourth step by utilizing the energy motion historical map corresponding to each frame of static image in the training set in the second step; testing a global time domain channel convolution neural network by utilizing an energy motion historical map corresponding to each frame of static image in the test set in the step two;
step six, after the parameters of the global time domain channel convolutional neural network trained in the step five are finely adjusted, the global time domain channel convolutional neural network is trained and tested by using the energy motion historical map corresponding to each frame of static image of the training set and the test set in the step three, and the probability value P of each category of the energy motion historical map corresponding to each frame of static image of the video sequence to be recognized is output1′,P2′,…,PN′;
Step seven, respectively fusing the output of the spatial channel convolutional neural network corresponding to each frame of static image and the output of the global time domain channel convolutional neural network in the video sequence to be identified, namely calculating the probability average value of each category of each frame of static image
Figure FDA0003175270080000011
And taking the category with the maximum probability average value as the action recognition result of each frame of static image.
2. The method for recognizing the human body action of the convolutional neural network fused with the global spatiotemporal features as claimed in claim 1, wherein the specific process of the step one is as follows:
selecting Inception V3 as a basic network structure, removing the last full-connection layer of the basic network structure, and sequentially adding a full-connection layer with 1024 neurons, a full-connection layer with 256 neurons and a full-connection layer with N action types of neurons from front to back.
3. The method for recognizing the human body action of the convolutional neural network fused with the global spatiotemporal features as claimed in claim 2, wherein the specific process of the second step is as follows:
migrating parameters of the first 10 layers of an inclusion V3 basic network structure model pre-trained on an ImageNet data set, namely migrating parameters of a 1 st convolution layer of the model to a 3 rd inclusion module to a spatial channel convolution neural network established in the first step, cutting a UCF101 video data set into standard input single-frame static images with the size of 299 multiplied by 299, randomly dividing the cut single-frame static images into training sets and test set data, sequentially inputting the static images in the training sets into the spatial channel convolution neural network, training by adopting an Adam gradient descent method, setting the size of mini-batch to be 32, adopting Keras default parameters as the parameters, and stopping training if the identification accuracy of the static images in the test set is not increased for at least 10 times continuously.
4. The method for recognizing human body actions through the convolutional neural network fused with global spatio-temporal features as claimed in claim 3, wherein in the third step, a data set of falling actions is collected as a video sequence to be recognized, the video sequence to be recognized comprises actions of falling, walking and sitting, each action comprises M video sequences, the M video sequences are randomly divided into a training set and a testing set, and each video sequence is cut into K frames of static images;
finely adjusting parameters of the spatial channel convolutional neural network, namely modifying the output class of the last layer of the spatial channel convolutional neural network to be 3;
sequentially inputting the static images of the training set into the spatial channel convolutional neural network after parameter fine adjustment, training the last full-link layer by adopting an Adam gradient descent method, training the last two full-link layers by adopting a random gradient descent method after at least 10 epochs are trained, setting the learning rate to be 0.0001 and setting the Momentum to be 0.9, and stopping training if the identification accuracy of the static images of the test set is not increased for at least 10 continuous times;
performing action recognition in a spatial channel convolutional neural network by adopting a multi-frame fusion mode, and averaging the output of the input current frame static image and the output of the previous frame static image; outputting probability values P of 3 categories corresponding to each frame of static image of video sequence to be recognized1,P2And P3
5. The method for identifying human body actions by using convolutional neural network fused with global spatiotemporal features as claimed in claim 4, wherein the specific process of the fifth step is as follows:
sequentially inputting the energy motion historical maps of the single-frame static images of the training set in the second step into the established global time domain channel convolutional neural network, training the global time domain channel convolutional neural network by adopting an Adam gradient descent method, setting the mini-batch size to be 32, adopting Keras default parameters as parameters, and stopping training if the action recognition accuracy of the test set is not increased for at least 10 times continuously;
the gray value of a pixel point with coordinates (x, y) in the energy motion history map corresponding to the t-th frame of static image is Hτ(x, y, t), according to the update function, resulting in:
Figure FDA0003175270080000031
in the formula: (x, y) is the position of the pixel point in the energy motion history map corresponding to the t-th frame of static image, and max represents taking 0 and HτThe larger of (x, y, t-1) -delta, Hτ(x, y, t-1) is the gray value of a pixel point with the coordinate (x, y) in the energy motion history map corresponding to the static image of the t-1 th frame; τ is duration, δ is decay parameter;
psi (x, y, t) is an updating function, whether each pixel point is in the foreground of the current frame is judged, if so, psi (x, y, t) is equal to 1, otherwise psi (x, y, t) is equal to 0;
ψ (x, y, t) is obtained by an inter-frame difference method:
Figure FDA0003175270080000032
D(x,y,t)=|I(x,y,t)-I(x,y,te)|
in the formula: i (x, y, t) is the gray value of the pixel point positioned at the (x, y) coordinate in the t-th frame static image; i (x, y, t)e) The gray value of the pixel point positioned at the (x, y) coordinate in the static image of the previous effective frame; ξ is the threshold used to discriminate between foreground and background; d (x, y, t) is I (x, y, t) and I (x, y, t)e) The absolute value of the difference of (a);
the process of calculating the energy motion history map comprises the following steps:
if the static image of the current frame is an effective frame, updating the energy motion historical map once, otherwise, not updating;
the judgment principle of the effective frame is as follows: setting a first frame static image as an effective frame, and if the motion energy of the current frame static image relative to the previous effective frame static image is greater than a threshold value mu, setting the current frame as the effective frame;
definition EtFor the t-th frame of the static image ItStatic image with respect to previous active frame
Figure FDA0003175270080000033
The motion energy of (2):
Figure FDA0003175270080000034
Figure FDA0003175270080000035
wherein: c is the number of pixel points with displacement of the t frame static image relative to the previous effective frame static image; h and w are the width and height of the t-th frame of still image, respectively; dt(x, y) is the displacement of the pixel point (x, y) in the t-th frame static image relative to the previous effective frame static image;
Figure FDA0003175270080000036
is the displacement of the pixel point (x, y) between the t-th frame static image and the previous effective frame static image in the horizontal direction,
Figure FDA0003175270080000037
is the displacement of the pixel point (x, y) between the t frame static image and the previous effective frame static image in the vertical direction;
computing a global dense optical flow:
Figure FDA0003175270080000041
in the formula:
Figure FDA0003175270080000042
is the optical flow in the horizontal direction and the vertical direction between the t-th frame static image and the previous effective frame static image; CalcOpticalFlowFarneback is an optical flow function.
6. The method for recognizing the human body actions of the convolutional neural network fused with the global space-time characteristics as claimed in claim 5, wherein in the sixth step, the parameters of the global time domain channel convolutional neural network are finely adjusted according to the fall action data sets collected in the third step, that is, the output category of the last layer of the global time domain channel convolutional neural network is modified to be 3;
sequentially inputting the energy motion history graphs corresponding to each frame of static image in the training set into a global time domain channel convolutional neural network after parameter fine adjustment, training the last full-link layer by adopting an Adam gradient descent method, training the last two full-link layers by adopting a random gradient descent method after at least 10 epochs are trained, setting the learning rate to be 0.0001 and setting the Momentum to be 0.9, and stopping training if the identification accuracy of the energy motion history graphs in the test set is not increased for at least 10 continuous times; outputting probability values P of 3 categories of energy motion historical maps corresponding to each frame of static images of video sequence to be recognized1′,P2' and P3′。
7. The method of claim 6, wherein the mean value of the probability of each category of each frame of static image of the fall motion data set is calculated
Figure FDA0003175270080000043
And
Figure FDA0003175270080000044
and taking the category with the maximum probability average value as the action recognition result of each frame of static image.
CN201810671262.9A 2018-06-26 2018-06-26 Convolutional neural network human body action recognition method fusing global space-time characteristics Active CN108830252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810671262.9A CN108830252B (en) 2018-06-26 2018-06-26 Convolutional neural network human body action recognition method fusing global space-time characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810671262.9A CN108830252B (en) 2018-06-26 2018-06-26 Convolutional neural network human body action recognition method fusing global space-time characteristics

Publications (2)

Publication Number Publication Date
CN108830252A CN108830252A (en) 2018-11-16
CN108830252B true CN108830252B (en) 2021-09-10

Family

ID=64137766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810671262.9A Active CN108830252B (en) 2018-06-26 2018-06-26 Convolutional neural network human body action recognition method fusing global space-time characteristics

Country Status (1)

Country Link
CN (1) CN108830252B (en)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508684B (en) * 2018-11-21 2022-12-27 中山大学 Method for recognizing human behavior in video
CN111261190A (en) * 2018-12-03 2020-06-09 北京嘀嘀无限科技发展有限公司 Method, system, computer device and storage medium for recognizing sound
CN109522874B (en) * 2018-12-11 2020-08-21 中国科学院深圳先进技术研究院 Human body action recognition method and device, terminal equipment and storage medium
CN109726672B (en) * 2018-12-27 2020-08-04 哈尔滨工业大学 Tumbling detection method based on human body skeleton sequence and convolutional neural network
CN109685037B (en) * 2019-01-08 2021-03-05 北京汉王智远科技有限公司 Real-time action recognition method and device and electronic equipment
CN110068302A (en) * 2019-03-07 2019-07-30 中科院微电子研究所昆山分所 A kind of vehicle odometry method based on deep neural network
CN109886358B (en) * 2019-03-21 2022-03-08 上海理工大学 Human behavior recognition method based on multi-time-space information fusion convolutional neural network
CN111832351A (en) * 2019-04-18 2020-10-27 杭州海康威视数字技术股份有限公司 Event detection method and device and computer equipment
CN110110624B (en) * 2019-04-24 2023-04-07 江南大学 Human body behavior recognition method based on DenseNet and frame difference method characteristic input
CN110110812B (en) * 2019-05-20 2022-08-19 江西理工大学 Stream depth network model construction method for video motion recognition
CN110334589B (en) * 2019-05-23 2021-05-14 中国地质大学(武汉) High-time-sequence 3D neural network action identification method based on hole convolution
CN110188653A (en) * 2019-05-27 2019-08-30 东南大学 Activity recognition method based on local feature polymerization coding and shot and long term memory network
CN110334607B (en) * 2019-06-12 2022-03-04 武汉大学 Video human interaction behavior identification method and system
CN110414367B (en) * 2019-07-04 2022-03-29 华中科技大学 Time sequence behavior detection method based on GAN and SSN
CN110399808A (en) * 2019-07-05 2019-11-01 桂林安维科技有限公司 A kind of Human bodys' response method and system based on multiple target tracking
CN110532431B (en) * 2019-07-23 2023-04-18 平安科技(深圳)有限公司 Short video keyword extraction method and device and storage medium
CN110610145B (en) * 2019-08-28 2022-11-08 电子科技大学 Behavior identification method combined with global motion parameters
CN110580681B (en) * 2019-09-12 2020-11-24 杭州海睿博研科技有限公司 High-resolution cardiac motion pattern analysis device and method
CN110705412A (en) * 2019-09-24 2020-01-17 北京工商大学 Video target detection method based on motion history image
CN111046821B (en) * 2019-12-19 2023-06-20 东北师范大学人文学院 Video behavior recognition method and system and electronic equipment
CN111353394B (en) * 2020-02-20 2023-05-23 中山大学 Video behavior recognition method based on three-dimensional alternate update network
CN111401507B (en) * 2020-03-12 2021-01-26 大同公元三九八智慧养老服务有限公司 Adaptive decision tree fall detection method and system
CN113468913B (en) * 2020-03-30 2022-07-05 阿里巴巴集团控股有限公司 Data processing method, motion recognition method, model training method, device and storage medium
CN111507252A (en) * 2020-04-16 2020-08-07 上海眼控科技股份有限公司 Human body falling detection device and method, electronic terminal and storage medium
CN111582231A (en) * 2020-05-21 2020-08-25 河海大学常州校区 Fall detection alarm system and method based on video monitoring
CN111866449B (en) * 2020-06-17 2022-03-29 中国人民解放军国防科技大学 Intelligent video acquisition system and method
CN112115788A (en) * 2020-08-14 2020-12-22 咪咕文化科技有限公司 Video motion recognition method and device, electronic equipment and storage medium
CN112115846B (en) * 2020-09-15 2024-03-01 上海迥灵信息技术有限公司 Method and device for identifying random garbage behavior and readable storage medium
CN112381118B (en) * 2020-10-23 2024-05-17 百色学院 College dance examination evaluation method and device
CN113158723B (en) * 2020-12-25 2022-06-07 神思电子技术股份有限公司 End-to-end video motion detection positioning system
CN112766062B (en) * 2020-12-30 2022-08-05 河海大学 Human behavior identification method based on double-current deep neural network
CN112766176B (en) * 2021-01-21 2023-12-01 深圳市安软科技股份有限公司 Training method of lightweight convolutional neural network and face attribute recognition method
CN112818914B (en) * 2021-02-24 2023-08-18 网易(杭州)网络有限公司 Video content classification method and device
CN113343786B (en) * 2021-05-20 2022-05-17 武汉大学 Lightweight video action recognition method and system based on deep learning
CN114360209B (en) * 2022-01-17 2023-06-23 常州信息职业技术学院 Video behavior recognition security system based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732208A (en) * 2015-03-16 2015-06-24 电子科技大学 Video human action reorganization method based on sparse subspace clustering
CN106778474A (en) * 2016-11-14 2017-05-31 深圳奥比中光科技有限公司 3D human body recognition methods and equipment
CN107808131A (en) * 2017-10-23 2018-03-16 华南理工大学 Dynamic gesture identification method based on binary channel depth convolutional neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639042B2 (en) * 2010-06-22 2014-01-28 Microsoft Corporation Hierarchical filtered motion field for action recognition
US20150339871A1 (en) * 2014-05-15 2015-11-26 Altitude Co. Entity management and recognition system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732208A (en) * 2015-03-16 2015-06-24 电子科技大学 Video human action reorganization method based on sparse subspace clustering
CN106778474A (en) * 2016-11-14 2017-05-31 深圳奥比中光科技有限公司 3D human body recognition methods and equipment
CN107808131A (en) * 2017-10-23 2018-03-16 华南理工大学 Dynamic gesture identification method based on binary channel depth convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Two-stream convolutional networks for action recognition in videos;Simonyan K et al.;《arxiv》;20180601;全文 *

Also Published As

Publication number Publication date
CN108830252A (en) 2018-11-16

Similar Documents

Publication Publication Date Title
CN108830252B (en) Convolutional neural network human body action recognition method fusing global space-time characteristics
CN109740419B (en) Attention-LSTM network-based video behavior identification method
CN109670446B (en) Abnormal behavior detection method based on linear dynamic system and deep network
WO2020173226A1 (en) Spatial-temporal behavior detection method
WO2021098261A1 (en) Target detection method and apparatus
CN106778796B (en) Human body action recognition method and system based on hybrid cooperative training
CN110516536A (en) A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification
CN109190479A (en) A kind of video sequence expression recognition method based on interacting depth study
CN111191667B (en) Crowd counting method based on multiscale generation countermeasure network
CN105069434B (en) A kind of human action Activity recognition method in video
Chaudhari et al. Face detection using viola jones algorithm and neural networks
CN109685037B (en) Real-time action recognition method and device and electronic equipment
CN110263712B (en) Coarse and fine pedestrian detection method based on region candidates
CN108960047B (en) Face duplication removing method in video monitoring based on depth secondary tree
WO2011028380A2 (en) Foreground object detection in a video surveillance system
WO2011028379A2 (en) Foreground object tracking
CN110097028B (en) Crowd abnormal event detection method based on three-dimensional pyramid image generation network
CN113963445A (en) Pedestrian falling action recognition method and device based on attitude estimation
CN112906631B (en) Dangerous driving behavior detection method and detection system based on video
CN113963032A (en) Twin network structure target tracking method fusing target re-identification
Mehta et al. Motion and region aware adversarial learning for fall detection with thermal imaging
CN112926522B (en) Behavior recognition method based on skeleton gesture and space-time diagram convolution network
CN112070010B (en) Pedestrian re-recognition method for enhancing local feature learning by combining multiple-loss dynamic training strategies
JP2017162409A (en) Recognizing device, and method, for facial expressions and motions
CN112487926A (en) Scenic spot feeding behavior identification method based on space-time diagram convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant