CN111046766A

CN111046766A - Behavior recognition method and device and computer storage medium

Info

Publication number: CN111046766A
Application number: CN201911215173.4A
Authority: CN
Inventors: 陈璐; 陆辉; 史海涛; 丁静
Original assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Current assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-04-21

Abstract

The invention provides a behavior recognition method, which is applied to the technical field of behavior recognition and comprises the following steps: acquiring a corresponding motion vector, a residual error and RGB frame data in a compressed video code stream file; obtaining accumulated residual data according to the residual; obtaining an accumulated motion vector according to the motion vector; using RGB frame data, accumulated motion vectors and accumulated residual data as the input of a deep learning model to obtain behavior characteristic vectors of the deep learning model; inputting the behavior feature vector into a classification model to obtain a classification result; and obtaining a behavior prediction classification result. And an apparatus and a computer storage medium are provided. By applying the embodiment of the invention, the time consumption caused by video decoding is avoided, the time bottleneck caused by the decoding link is eliminated, and the analysis efficiency of the video file is effectively improved.

Description

Behavior recognition method and device and computer storage medium

Technical Field

The present invention relates to the field of behavior recognition processing technologies, and in particular, to a behavior recognition method, a behavior recognition device, and a computer storage medium.

Background

With the vigorous development of city video monitoring project construction, analysis of video recording files generated by a video monitoring system is often a means of social station management.

In a conventional video analysis method, a video compression code stream is completely decoded, a pixel domain is analyzed, for example, decoding in common h.264 and h.265 formats is performed to obtain key frames and non-key frames corresponding to a video frame sequence, the key frames and the non-key frames are analyzed to obtain accumulated motion vectors and accumulated residual data, and then behavior recognition is performed through a human behavior recognition algorithm based on deep learning to obtain a recognition result.

Therefore, in the prior art, the compressed video code stream file needs to be decoded, time consumption caused by video decoding is caused, and the analysis efficiency of the video file is low.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a behavior recognition method, a behavior recognition device and a computer storage medium, wherein RGB frame data corresponding to a key frame, residual data of a non-key frame and a motion vector are obtained only by compressing a video code stream file, and then learning and classification are carried out according to the existing deep learning model, so that the compressed video code stream file is not required to be decoded, the time consumption caused by video decoding is avoided, the time bottleneck caused by the decoding link is eliminated, and the analysis efficiency of the video file is effectively improved.

The invention is realized by the following steps:

the invention provides a behavior recognition method, which comprises the following steps:

acquiring key frame data and non-key frame data in a compressed video code stream file, wherein the non-key frame data comprises: the method comprises the steps of obtaining a motion vector and a residual error, wherein the key frame data are RGB frame data;

obtaining accumulated residual data according to the residual;

obtaining an accumulated motion vector according to the motion vector;

taking the RGB frame data, the accumulated motion vector and the accumulated residual data as the input of a deep learning model, and obtaining a behavior characteristic vector of the deep learning model;

inputting the behavior feature vector output by the deep learning model into an SVM classifier for behavior prediction;

and obtaining a behavior prediction classification result.

Further, the step of obtaining key frame data and non-key frame data in the compressed video code stream file includes:

and decoding the compressed video code stream file by adopting a media file conversion tool to obtain key frame data, and extracting the motion vector and the residual error of the non-key frame.

Further, the obtaining of the accumulated motion vector according to the motion vector specifically includes:

alternatively, the first and second electrodes may be,

wherein the content of the first and second substances,

wherein the content of the first and second substances,

a motion vector in the p-th frame representing a block of pixels at position i of the t-th frame, p ≦ t,

accumulated motion vectors of pixel blocks representing position i of the t-th frame from the k-th frame to the t-th frame,

the pixel block representing position i of the t-th frame is traced back from the t-th frame to the reference position of the k-th frame.

Further, the specific expression adopted for obtaining the accumulated residual data is as follows:

wherein the content of the first and second substances,

for the accumulated residual of the ith pixel block in the tth frame,

the residual error of the ith pixel block in the t-th frame is the backtracking position of the pixel block in the t-1 frame

Corresponding residual error is

Further, the step of using the RGB frame data, the accumulated motion vector, and the accumulated residual data as input of a deep learning model and obtaining a behavior feature vector of the deep learning model includes:

acquiring an accumulated motion vector corresponding to each non-key frame and residual error data corresponding to each non-key frame;

forming an input sequence by the RGB frame data, the accumulated motion vector corresponding to each non-key frame and the residual error data corresponding to each non-key frame;

and taking the input sequence as the input of a deep learning model, and obtaining a behavior feature vector of the deep learning model.

Further, the step of classifying the feature vectors according to the classification model to obtain a classification result includes:

and classifying the feature vectors according to a Support Vector Machine (SVM) to obtain a classification result.

Further, the training process of the deep learning model comprises the following steps:

obtaining a test data set corresponding to multiple types of behaviors, wherein the test data set comprises: RGB frame data, accumulated motion vectors and accumulated residual data;

constructing an input layer: the test data set is used for determining the number of the neurons of the input layer and receiving the test data set;

constructing a rolling layer: the step of constructing the convolution layer is to determine the size and the step length of the convolution kernel, and the size of the convolution kernel is determined according to the size of the input data scale and the type of the data;

constructing a down-sampling layer for completing the determination of the pooling size and step size and the pooling type;

constructing a full connection layer;

the connection mode is as follows: an input layer, a convolutional layer, a sampling layer, a convolutional layer and a full-connection layer;

and when the model precision is not less than the preset value, determining the current neural network as an available model.

In addition, the invention also discloses a behavior recognition device, which comprises a processor and a memory connected with the processor through a communication bus; wherein the content of the first and second substances,

the memory is used for storing a behavior recognition program;

the processor is configured to execute the behavior recognition program to implement any of the behavior recognition steps.

Also, a computer storage medium is disclosed that stores one or more programs that are executable by one or more processors to cause the one or more processors to perform any of the behavior recognition steps.

The behavior recognition method, the behavior recognition device and the computer storage medium have the advantages that the method, the device and the computer storage medium are applied to the following steps:

(1) obtaining RGB frame data, accumulated residual data of each frame and accumulated motion vectors by directly obtaining key frame data and non-key frame data from a compressed video code stream file; then, the RGB frame data, the accumulated motion vector and the accumulated residual data are used as the input of a deep learning model, and a behavior characteristic vector of the deep learning model is obtained; and inputting the behavior feature vector into a classification model to obtain a classification result, namely a behavior recognition result. According to the method, only RGB frame data corresponding to the key frames, residual error data of non-key frames and motion vectors need to be obtained through the compressed video code stream file, then learning and classification are carried out according to the existing deep learning model, the compressed video code stream file does not need to be decoded, time consumption caused by video decoding is avoided, time bottleneck caused by a decoding link is eliminated, and analysis efficiency of the video file is effectively improved.

(2) The human body behaviors in the video are identified by combining the convolutional neural network, so that the rapid, efficient and accurate behavior identification is achieved.

(3) The dependency relationship of the decoding sequence of the non-key frames is removed through the decoupling model, all frame data can be processed in parallel by using hardware such as a GPU (graphics processing unit), a multi-core processor and the like, and the processing time of the non-key frame data is shortened.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a behavior recognition method according to an embodiment of the present invention;

fig. 2 is a schematic view of an application scenario of the behavior recognition method according to the embodiment of the present invention;

fig. 3 is a schematic view of an application scenario of the behavior recognition apparatus according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a behavior identification method, including the following steps:

s101, obtaining key frame data and non-key frame data in a compressed video code stream file, wherein the non-key frame data comprises: the motion vector and the residual error, and the key frame data are RGB frame data.

It should be noted that the I frame represents a key frame, which can be understood as a complete reservation of this frame; only the frame data is needed to complete the decoding, and the frame data comprises a complete picture; the P frame represents the difference between the frame and a previous key frame (or P frame), and the difference defined by the frame needs to be superimposed on the previously buffered picture to generate the final picture when decoding. Thus, a P frame represents a different frame, which has no full picture data, only data that is different from the picture of the previous frame.

It will be appreciated that for a video sequence S, it is expressed as S ═ { I, P …. P, I, P, … P … I }, where I is a key frame and P is a non-key frame. In the video code stream of the compressed domain, frame data of the code stream is acquired through an open source tool ffmpeg, so that a set obtained by extracting key frame data in a video sequence is expressed as S_IThe non-critical frame data set is expressed as S, { I, … I }, where S is the number of frames in the non-critical frame data set_p＝{P,P,…P}。

In the embodiment of the invention, the compressed video code stream file is not decoded, but an open source tool ffmpeg is directly adopted to carry out motion vector mapping on each relevant motion vector in the compressed video code stream file, and the motion vectors can form a motion vector diagram I_motionAnd obtaining residual errors, composing a residual error map I from residual error data_residual。

It can be understood that, the correspondence between the compressed video code stream file and the time can obtain different key frames and non-key frames, as shown in fig. 2, the obtained key frame P, and the motion vector and the residual error corresponding to the non-key frame P are given at different times.

It is understood that in inter coding, the relative displacement between the current coding block and the best matching block in its reference picture is represented by a Motion Vector (MV). Each divided block has corresponding motion information to be transmitted to a decoding end. If the MVs of each block are coded and transmitted independently, especially divided into small-sized blocks, a considerable number of bits are consumed. In the process of coding the MV of the current macroblock, the H.264/AVC firstly uses the MVs of adjacent coded blocks to predict the MV of the current macroblock, and then codes a difference value (marked as MVD (motion Vector difference)) between a predicted value (marked as MVP (motion Vector prediction)) of the MV and a real estimated value of the MV, thereby effectively reducing the coding bit number of the MV.

For example, estimating a motion vector for each 4 × 4 sub-block in an image, first calculating the median of all motion vectors in region a1, where there is one motion vector for each 4 × 4 sub-block in region a1, and extracting the motion vectors from a video stream during decoding; the total number of the calculated motion vectors is 16, which is mv i, i is 0,1, …,15, and the calculation result of the motion vector of each 4 × 4 sub-block is: mv 0-mv 15. Then, according to the current macroblock type, for example, P16 × 16 is selected, where the P16 × 16 type includes an initial motion vector mv16 × 16 and a refinement step re16 × 16, and then the motion vector of the block is calculated.

It should be noted that a B picture (frame) is a coded picture, also called a bidirectional predicted frame, which compresses the amount of data to be transmitted, taking into account both the coded frame preceding the source picture sequence and the temporal redundancy information between the coded frames following the source picture sequence.

The encoding process of P frame and B frame encoding is as follows: motion estimation is performed and the rate distortion function (pitch) values for the inter-frame coding mode are calculated. P frames refer to only preceding frames and B frames may refer to following frames. And performing intra-frame prediction, comparing the selected intra-frame mode with the minimum rate distortion function value with the inter-frame mode, and determining which coding mode is adopted to calculate the difference value between the actual value and the predicted value. The residual is transformed and quantized and then encoded. Therefore, a residual can be obtained in the encoded data. As shown in fig. 2, the residual error corresponding to the P frame at each time is obtained after decoding.

And S102, obtaining accumulated residual error data according to the residual error.

Accordingly, the accumulated residual can be expressed as:

wherein the content of the first and second substances,

for the accumulated residual of the ith pixel block in the tth frame,

the residual error of the ith pixel block in the tth frame relative to the previous frame is the backtracking position of the pixel block in the previous frame (namely t-1 frame)

Corresponding residual error is

And analogizing in turn to obtain the accumulated residual calculation formula.

And S103, acquiring an accumulated motion vector according to the motion vector.

It can be understood that the obtained motion vector diagram is decoupled to obtain an accumulated motion vector, and the steps are as follows:

for any P frame t, by^(t)Representing the spatial displacement of a group of pixels in the t-th frame, the reference position in the previous frame for a block of pixels that appears at spatial position i in the t-th frame can be expressed as:

further, by

The pixel block representing the position i of the t-th frame is in the p-th frame (p)<t), then the location of the backward trace in the k (k ≦ t) th frame may be expressed as

Where i is the position of the pixel block,

indicating that the pixel block is traced back from the t-th frame to the reference position of the k-th frame.

The accumulated motion vector map can be represented as

Wherein the content of the first and second substances,

the accumulated motion vector representing the pixel block at position i from the kth frame to the tth frame is calculated by subtracting the backtracking position in the kth frame from the current position i.

It will be appreciated that based on the processing of each non-key frame, an accumulated motion vector and an accumulated residual may be obtained for each non-key frame, as shown in fig. 2, for each non-key frame corresponding to a motion vector, a residual, an accumulated motion vector, and accumulated residual data.

And S104, taking the RGB frame data, the accumulated motion vector and the accumulated residual data as the input of a deep learning model, and obtaining a behavior characteristic vector of the deep learning model.

It should be noted that the deep learning model is trained in advance, and is used for training according to RGB frame data, the accumulated motion vectors, and the accumulated residual data, and obtaining a model corresponding to the behavior feature vectors.

The deep learning model in the embodiment of the present invention is a Convolutional Neural Network (CNN), which is a kind of feed forward Neural network (fed Neural Networks) that includes convolution calculation and has a deep structure, and is one of the representative algorithms of deep learning (deep learning).

Further, the step of using the RGB frame data, the accumulated motion vector, and the accumulated residual data as input of a deep learning model and obtaining a behavior feature vector of the deep learning model includes: acquiring an accumulated motion vector corresponding to each non-key frame and residual error data corresponding to each non-key frame; forming an input sequence by the RGB frame data, the accumulated motion vector corresponding to each non-key frame and the residual error data corresponding to each non-key frame; and taking the input sequence as the input of a deep learning model, and obtaining a behavior feature vector of the deep learning model.

It should be noted that, based on the above formula, the accumulated motion vector D and the residual data R corresponding to each non-key frame can be obtained, and assuming t frames in total, the obtained input image sequence is { I⁽⁰⁾,φ⁽¹⁾,R⁽¹⁾,…φ^(t),R^(t)I is RGB frame data, which is the input to the convolutional neural network.

constructing a full connection layer;

specifically, the connection mode is as follows: an input layer, a convolutional layer, a sampling layer, a convolutional layer and a full-connection layer; and when the model accuracy is not less than the preset value, the current neural network is considered as an available model. Each cube of the 3D convolution kernel convolution consists of 9 consecutive frames of the input image, with a patch size of 60x40 per frame.

After multi-layer convolution and down-sampling, each successive 9 frames of the input image is converted into a 128-dimensional feature vector that captures the motion information of the input frame. The number of nodes of the output layer is consistent with the number of types of behaviors, and each node is fully connected with the 128 nodes in C6. Finally, a linear classifier is adopted to classify the 128-dimensional feature vectors, and behavior recognition is achieved.

And S105, inputting the behavior feature vector output by the deep learning model into an SVM classifier for behavior prediction.

It should be noted that the classification model is a Support Vector Machine SVM, an SVM (Support Vector Machine, SVM for short) is a generalized linear classifier (generalized linear classifier) that performs binary classification on data in a supervised learning (supervised learning) manner, and a decision boundary is a maximum-margin hyperplane (maximum-margin hyperplane) that solves a learning sample. The SVM calculates an empirical risk (empirical risk) using a hinge loss function (change loss) and adds a regularization term to a solution system to optimize a structural risk (structural risk), which is a classifier with sparsity and robustness.

The embodiment of the invention adopts a linear SVM, gives input data and a learning target, the hard boundary SVM is an algorithm for solving a maximum edge-margin hyperplane (maximum-margin hyperplane) in a linear separable problem, and the constraint condition is that the distance between a sample point and a decision boundary is more than or equal to 1. The hard boundary SVM can be converted into an equivalent quadratic convex optimization (quadratic convex optimization) problem to be solved, and the decision boundary can classify any sample.

The behavior recognition and classification method adopts a Support Vector Machine (SVM) classifier, the SVM maps nonlinear samples to a high-dimensional space by utilizing the idea of kernel function so as to enable the nonlinear samples to be linearly separable, and then the optimal segmentation hyperplane is found by maximizing the classification interval between data sets.

The SVM classifier calculates an optimal segmentation hyperplane, and the equation is as follows:

w^Tx+b＝0

where x is the input vector, w is the weight vector, and b is the bias term.

For each data point (x, y) in the sample space, the following inequality is satisfied:

y_i(w^Tx_i+b)≥0

the problem of calculating the optimal classification surface can be converted into a dual problem by adopting a Lagrange optimization method, and when the optimal classification surface is searched, a kernel function K (x) can be selected_i,x_j) And solving the linear classification problem after nonlinear transformation. The kernel function is defined as follows:

if X is the input space and H is the feature space, if there is a mapping from X to H:

φ(x):X→H

so that for all x_i,x_jE.g. X, function K (X)_i,x_j) The conditions are satisfied:

K(x_i,x_j)＝φ(x_i)·φ(x_j)

in the formula, K (x)_i,x_j) Is a kernel function, phi (x) is a mapping function, phi (x)_i)·φ(x_j) Is phi (x)_i) And phi (x)_j) The inner product of (d).

The classification calculation function is formulated as follows:

in the formula (I), the compound is shown in the specification,

as lagrange multiplier, b^*To classify the threshold, K (x)_iX) is an inner product function, X_iAnd y_iAre vector coordinates in sample space.

Data point x belongs to this category when f (x) > 0; otherwise, data point x does not belong to this category.

The commonly used inner product kernel functions include polynomial kernel functions, radial basis functions and Sigmoid functions, and gaussian radial basis kernel functions are used in the present invention. The classification process of the SVM classifier is as follows:

inputting the behavior recognition sample data set into an SVM classifier for training; optimizing the parameters, and constructing a training model by using the parameters after obtaining the optimal parameters; inputting the behavior feature vector output by the convolutional neural network into an SVM classifier for behavior prediction; and acquiring a behavior prediction classification result and an identification rate.

And S106, acquiring a behavior prediction classification result.

It should be noted that the categories of the behaviors include running, jumping, walking, and the like, and whether the current behavior is running, jumping, or walking can be obtained according to the classification result of the SVM, so the classification result of the SVM is used as the recognition result.

In addition, as shown in fig. 3, the present invention also discloses a behavior recognition device 300, wherein the device 300 comprises a processor 310, and a memory 320 connected with the processor 310 through a communication bus 330; wherein the content of the first and second substances,

the memory 320 is used for storing a behavior recognition program;

the processor 310 is configured to execute the behavior recognition program to implement any one of the behavior recognition steps.

And a computer storage medium storing one or more programs executable by one or more processors 310 as shown in fig. 3 to cause the one or more processors 310 to perform any of the behavior recognition steps are disclosed.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of behavior recognition, the method comprising:

obtaining accumulated residual data according to the residual;

obtaining an accumulated motion vector according to the motion vector;

taking the RGB frame data, the accumulated motion vector and the accumulated residual data as the input of a deep learning model to obtain a behavior characteristic vector of the deep learning model;

and obtaining a behavior prediction classification result.

2. The behavior recognition method according to claim 1, wherein the step of obtaining key frame data and non-key frame data in the compressed video code stream file comprises:

3. A method for behavior recognition according to claim 1 or 2, wherein said deriving from said motion vectors a representation of an accumulated motion vector comprises:

alternatively, the first and second electrodes may be,

wherein the content of the first and second substances,

wherein the content of the first and second substances,

the motion vector of the pixel block at position i of the t-th frame in the p-th frame, p ≦ t, φ_i ^(t,k)Accumulated motion vectors for pixel blocks representing position i of the t-th frame from the k-th frame to the t-th frame η_i ^(t,k)The pixel block representing position i of the t-th frame is traced back from the t-th frame to the reference position of the k-th frame.

4. A method for behavior recognition according to claim 3, wherein the specific expression used to obtain the accumulated residual data from the residual is as follows:

wherein the content of the first and second substances,

for the accumulated residual of the ith pixel block in the tth frame,

Corresponding residual error is

5. The behavior recognition method according to claim 1, wherein the step of obtaining the behavior feature vector of the deep learning model by using the RGB frame data, the accumulated motion vector, and the accumulated residual data as the input of the deep learning model comprises:

6. The behavior recognition method according to any one of claims 1-2 and 4-5, wherein the step of classifying the feature vectors according to a classification model to obtain a classification result comprises:

7. The behavior recognition method according to claim 6, wherein the training process of the deep learning model includes:

constructing a full connection layer;

8. A behavior recognition apparatus, comprising a processor, and a memory connected to the processor via a communication bus; wherein the content of the first and second substances,

the memory is used for storing a behavior recognition program;

the processor configured to execute the behavior recognition program to implement the behavior recognition step according to any one of claims 1 to 7.

9. A computer storage medium, characterized in that the computer storage medium stores one or more programs executable by one or more processors to cause the one or more processors to perform the behavior recognition steps of any of claims 1 to 7.