CN111860278B

CN111860278B - Human behavior recognition algorithm based on deep learning

Info

Publication number: CN111860278B
Application number: CN202010676134.0A
Authority: CN
Inventors: 张鹏超; 罗朝阳; 徐鹏飞; 刘亚恒
Original assignee: Shaanxi University of Technology
Current assignee: Shaanxi University of Technology
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2024-05-14
Anticipated expiration: 2040-07-14
Also published as: CN111860278A

Abstract

The invention provides a human behavior recognition algorithm based on deep learning, which comprises (1) preprocessing an input video segment; (2) constructing a network model RD3D; (3) Defining a loss function, accuracy and operation of an optimizer; (4) training the network model comprising the sub-steps of: (41) initializing parameters; (42) a learning rate of 0.0001, a batch size of 16; (43) Calculating a loss by the forward propagation value of the RD3D model and the real tag, and updating the weight parameter by backward propagation; (44) ending the training after 100 epochs have been trained; (5) test results. The invention pursues the accuracy of the recognition algorithm from the characteristic angle, solves the problem that the existing algorithm is seriously dependent on the data set, reduces the sensitivity to the type of the data set, and can be applied to any behavior recognition data set.

Description

Human behavior recognition algorithm based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a human behavior recognition algorithm based on deep learning.

Background

In recent years, with the rise of related technologies such as deep learning, deep neural networks have made breakthrough progress in various fields such as computer vision. Deep learning is capable of learning its commonality features from training data due to its end-to-end training characteristics and fitting out a network suitable for the current task. Meanwhile, mass data acquisition in the modern society becomes very easy, and convenience is provided for deep learning application to the fields of video understanding, recognition and the like.

Whereas in the conventional method, the local features (such as HOG, HOF and MBH) are mainly extracted, a great a priori knowledge is required. Although apparent and motion information is considered, such information is limited to a single frame, where the contextual appearance of the frame and motion information is ignored, which results in less accurate identification of human behavior. How to design an algorithm for behavior recognition is particularly important.

Therefore, it has been a trend to apply deep learning and human behavior recognition. The behavior recognition method based on deep learning mainly comprises the following steps: the double-flow convolutional neural network, the 3D convolutional neural network, the convolutional neural network and the cyclic neural combination. The invention is based on a 3D convolution network, and improves the recognition precision.

Patent CN 110163133A, "a human body behavior recognition method based on a depth residual error network", discloses a human body behavior recognition method based on a depth residual error network, wherein human body joint data and depth image data are simultaneously input into ResNet for recognition, but the recognition accuracy is improved, but the human body joint data and the depth image are required to be input, so that the end-to-end learning cannot be realized, and the data are lacking in daily life. Patent CN 107862275A, human body behavior recognition model, construction method and human body behavior recognition method, discloses a method for extracting human body behavior feature vectors by adopting a 3D convolutional neural network, inputting the extracted feature vectors into a coulomb force field, and clustering all feature vectors by relative movement under the action of attraction force generated by the same class and repulsion force generated by different classes to finish human body behavior recognition. The RGB diagram and the optical flow diagram are input into the network for learning, and the end-to-end learning is not performed, the whole network has seven layers, wherein only three layers perform feature extraction, and the accuracy is low although the calculated amount is small;

The recognition accuracy is improved from the point of conforming to the data set, the accuracy of behavior recognition can not be improved by means of RGB images, and patent CN 109002808A 'a human behavior recognition method and system' discloses a human behavior recognition method and system, a multi-task deep learning method is used for training a 3D convolutional neural network, continuous video frames of various human behavior attributes and background videos are used as input, and recognition tasks are completed after training. More, how to make a data set in the application of multitask learning is taught, so that a behavior video and a background video are distinguished, feature extraction is completed only by means of a seven-layer common 3D convolution network, and classification is achieved. Identification of human behavior is still accomplished from the perspective of the data set.

Disclosure of Invention

Aiming at the technical problems, the invention provides a human behavior recognition algorithm based on deep learning, which comprises the following steps:

(1) Preprocessing an input video segment;

(2) Constructing a network model RD3D;

(3) Defining a loss function and an operation of an optimizer;

(4) Training the network model comprises the sub-steps of:

(41) Initializing parameters;

(42) Learning rate is 0.0001, and batch size is 16;

(43) Calculating a loss according to a loss function by the forward propagation value of the RD3D model and the real tag, and updating weight parameters by backward propagation of the loss;

(44) Finishing training after training 100 epochs;

(5) And (5) testing results.

Furthermore, in the preprocessing stage of the step (1), in order to comprehensively consider the global motion information of the video, a secondary sampling algorithm is provided and adopted to collect n frames of key video frames, so that the recognition accuracy is improved, and the specific contents are as follows:

a: each video segment is subjected to image frame acquisition (alpha=3) according to an acquisition rate alpha, and an image data set A corresponding to each video is obtained;

b: uniformly acquiring n frames (n=16) from the image data set A by adopting a subsampling algorithm, and scaling the n frames to k×k (k=224) to form a data set B as key frames of video clips;

c: the acquired data set B was processed as 7: the 3 proportion is divided into a training set and a testing set for training and testing, wherein each sample in the training set is a four-element (candidate, positive, negative, label) sample to be predicted, other samples in the same category with the sample to be predicted, other samples in different categories with the sample to be predicted, and category labels of the sample to be predicted.

Furthermore, in step (2), in order to improve the recognition accuracy, a novel network model RD3D (Residual Dense 3D) is proposed and designed by combining the feature multiplexing idea and the shortcut idea. The RD3D model designs 134 layers, i.e., 1+4×4+6×6×3+2×4+1,6 stages.

Further, step (3) proposes and designs a new type of loss function:

F＝H(P,Q)+L_re+L_tr

Wherein:

Cross entropy H (P, Q) = -P (x) log (Q (x)) measures the similarity of the predicted and true distributions, the smaller the loss, the more accurate the classification. Wherein P is the real sample distribution, and Q is the predicted sample distribution;

l2 regularization loss To prevent overfitting, where λ is the penalty factor (λ=0.009), n is the number of weights W;

ternary loss Where ||f (x _i)-f(x^p _i)||₂ ² is the Euclidean distance of x _i and x ^p _i, ||f (x _i)-f(xⁿ _i)||₂ ² is the Euclidean distance of x _i and x ⁿ _i, |f (x) is the characteristics of sample x extracted by RD3D, bs is batchsize, x _i is the sample currently predicted, x ^p _i is the sample of the same class as currently calculated sample x _i, x ⁿ _i is the sample not of the same class as currently calculated sample x _i, and β is the distance threshold of x _i and x ^p _i、x_i and x ⁿ _i (β=0.2).

The invention overcomes the problem that the existing algorithm is seriously dependent on the data set while pursuing the accuracy of the identification algorithm, realizes the network structure design from the aspect of the characteristics extracted from the human behavior, is insensitive to the type of the data set, and can be applied to any data set.

Drawings

FIG. 1 is a RD3D model of the present invention;

FIG. 2 is a ConvBlock of the present invention;

FIG. 3 is a IDBlock of the present invention;

Fig. 4 is a flow chart of the present invention.

Detailed Description

The specific technical scheme of the invention is described by combining the embodiments.

As shown in fig. 4, a human behavior recognition algorithm based on deep learning includes the following steps:

(1) Preprocessing an input video segment (UCF 101 dataset is taken as an example in this embodiment);

(2) Constructing a network model RD3D;

(3) Defining a loss function, accuracy and operation of an optimizer;

(4) Training the network model comprises the sub-steps of:

(41) Initializing parameters;

(42) Learning rate is 0.0001, and batch size is 16;

(44) Finishing training after training 100 epochs;

(5) And (5) testing results.

Specific:

(1) In the preprocessing stage, in order to comprehensively consider the video global motion information, a secondary sampling algorithm is proposed and adopted to collect n frames of key frames, so that the recognition accuracy is improved, and the specific contents are as follows:

(2) In order to improve the recognition accuracy, a novel network model RD3D (Residual Dense 3D) is provided and designed by combining the characteristic multiplexing idea and the shortcut idea, the structure of the novel network model RD3D is shown in figure 1, 127 layers, 6 stages are designed by the RD3D model, and the content is as follows:

a: stage1 consists of Conv3d, BN, relu, maxPool, where Conv3d has 64 filters, convolution kernel 3 x 3, stride= [1, 2], and padding is SAME; the pooling window in MaxPool is 1×3×3, stride= [1, 2]. Stage1 has an input dimension of [16, 16, 224,3] and an output dimension of [16, 16, 56, 56, 64];

b: stage2 consists of Conv Block4, three ID Block4, maxpool, where Conv Block4 is connected together by the addition of channels by a 4-layer 3D convolution group and a shortcut with two-layer convolution, as shown in fig. 2, wherein the number of filters of the 4-layer 3D convolution sets is 64, 64, 128, 128, the convolution kernels are all 3 x 3; in the convolution set, the input of the next layer is the output of all the previous layers in the block. The number of filters in shortcut is 128, the convolution kernels are 1×1,3×3, respectively. ID Block4 is a combination of a 4-layer 3D convolution and the input addition of the Block, as shown in figure 3, wherein the number of filters of the 4-layer 3D convolution sets is 64, 64, 128, 128, the convolution kernels are all 3 x 3; in the convolution set, the input of the next layer is the output of all the previous layers in the block. Pooling window 2 x 2, stride= [2, 2] in Maxpool. Stage2 has an input dimension of [16,16,56,56,64] and an output dimension of [16,8,28,28,128];

c: the composition of stage3 and stage4 is identical to stage2, the only difference is that the number of layers and the number of filters in each block are different, in stage3, the number of layers of ConvBlock and IDBlock is 6, the number of filters in each layer is 128, 128, 256, 256, 512, the number of filters of shortcut in ConvBlock is 512, the input dimension of stage3 is [16,8,28,28,128], and the output dimension is [16,4,14,14,512]; in stage4, the number of layers ConvBlock and IDBlock is 6, the number of filters in each layer is 256, 256, 512, 512, 1024, and the number of filters in shortcut in convblock is 1024.Stage4 has an input dimension of [16,4,14,14,512] and an output dimension of [16,2,7,7,1024];

d: stage5 and stage2 differ in composition in that stage5 does not have MaxPool. In ConvBlock and IDBlock, the number of layers is 6, the number of filters of each layer is 512, 512, 1024, 1024, 2048, the number of filters of shortcut in ConvBlock is 2048, the input dimension of stage5 is [16,2,7,7,1024], and the output dimension is [16,2,7,7,2048];

e: stage6 is composed of AvgPool, flatten, FC, softmax, as shown in fig. 1, in which AvgPool is global average pooling, the pooling window is 2×7×7, and flat is that output reshape of the upper layer is [16,2048], FC is the full connection layer, output dimension is ucf of category number 101, softmax is the classification layer. Stage6 has an input dimension of [16,2,7,7,2048] and an output dimension of [16,101]

(3) The loss function is designed. In order to expand the distinction of different types of samples and improve the recognition precision, the invention adds ternary loss on the basis of the traditional loss function to obtain a new loss function:

F=h (P, Q) +l _re+L_tr wherein,

A: cross entropy H (P, Q) = -P (x) log (Q (x)) measures the similarity of the predicted and true distributions, the smaller the loss, the more accurate the classification. Wherein P is the real sample distribution, and Q is the predicted sample distribution;

b: l2 regularization loss To prevent overfitting, where λ is the penalty factor (λ=0.009), n is the number of weights W;

c: ternary loss:

Where ||f (x _i)-f(x^p _i)||₂ ² is the Euclidean distance of x _i and x ^p _i, ||f (x _i)-f(xⁿ _i)||2² is the Euclidean distance of x _i and x ⁿ _i, |f (x) is the characteristics of sample x extracted by RD3D, bs is batchsize, x _i is the sample currently predicted, x ^p _i is the sample of the same class as currently calculated sample x _i, x ⁿ _i is the sample not of the same class as currently calculated sample x _i, and β is the distance threshold of x _i and x ^p _i、x_i and x ⁿ _i (β=0.2).

Claims

1. The human behavior recognition algorithm based on deep learning is characterized by comprising the following steps of:

(1) Preprocessing an input video segment;

in the preprocessing stage of the step (1), a secondary sampling algorithm is provided and adopted to collect n frames of key frames, and the method specifically comprises the following steps:

a: acquiring image frames of each video clip according to the acquisition rate alpha to obtain an image data set A corresponding to each video;

b: uniformly acquiring n frames from the image data set A by adopting a subsampling algorithm, taking the n frames as key frames of video clips, and scaling the key frames to k x k to form a data set B;

c: the acquired data set B was processed as 7: the 3 proportion is divided into a training set and a testing set for training and testing, wherein each sample in the training set is a quadruple, and is a sample to be predicted, other samples of the same class of the sample to be predicted, other samples of different classes of the sample to be predicted, and class labels of the sample to be predicted;

(2) Constructing a network model RD3D;

(3) Defining a loss function and an operation of an optimizer;

(4) Training a network model; comprises the following substeps:

(41) Initializing parameters;

(42) Learning rate is 0.0001, and batch size is 16;

(44) Finishing training after training 100 epochs;

(5) And (5) testing results.

2. The human behavior recognition algorithm based on deep learning of claim 1, wherein the RD3D model of step (2) designs 134 layers, i.e., 1+4×4+6×6×3+2×4+1,6 stages.

3. The human behavior recognition algorithm based on deep learning of claim 1, wherein step (3) designs a loss function:

F＝H(P,Q)+L_re+L_tr

wherein, cross entropy H (P, Q) = -P (x) log (Q (x)), the similarity of the predicted distribution and the real distribution is measured, the smaller the loss is, the more accurate the classification is; wherein P is the real sample distribution, and Q is the predicted sample distribution;

l2 regularization loss To prevent overfitting, where λ is a penalty factor and n is the number of weights W;

Ternary loss:

Where ||f (x _i)-f(x^p _i)||₂ ² is the Euclidean distance of x _i and x ^p _i, ||f (x _i)-f(xⁿ _i)||₂ ² is the Euclidean distance of x _i and x ⁿ _i, |f (x) is the RD3D extracted feature of sample x, bs is batchsize, x _i is the current predicted sample, x ^p _i is the same class of sample as the current calculated sample x _i, x ⁿ _i is a sample not the same class as the current calculated sample x _i, and β is the distance threshold of x _i and x ^p _i、x_i and x ⁿ _i.