CN112464856A

CN112464856A - Video streaming detection method based on human skeleton key points

Info

Publication number: CN112464856A
Application number: CN202011431363.2A
Authority: CN
Inventors: 张洋; 刘盾; 唐学怡; 沈余银; 宋升�; 黄信云
Original assignee: Chengdu Chinamcloud Technology Co ltd
Current assignee: Chengdu Chinamcloud Technology Co ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-03-09
Anticipated expiration: 2040-12-09
Also published as: CN112464856B

Abstract

The invention relates to a video stream motion detection method based on human skeleton key points, which is characterized in that a sliding window of m seconds is utilized to intercept m seconds of video and n frames per second. And respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame. And then, the interframe skeleton data is split into a plurality of skeleton sequences according to Euclidean distances, namely one skeleton sequence is used by one person, and the method mainly aims at the action detection and identification of the video with the variable length. And 1 time real-time speed can be achieved on 2080TI level GPU. Therefore, the video motion detection and identification have practical effects.

Description

Video streaming detection method based on human skeleton key points

Technical Field

The invention relates to the field of video identification, in particular to a video flow detection method based on human skeleton key points.

Background

The motion detection is mainly based on a human body posture model, and is used for identifying motion pictures acquired by videos, for example, a Chinese patent with publication (announcement) number CN107194344A discloses a human body behavior identification method adaptive to a skeleton center. The problem that the action recognition precision is low in the prior art is mainly solved. The method comprises the following implementation steps: 1) acquiring a three-dimensional skeleton sequence from the skeleton sequence data set, and preprocessing the three-dimensional skeleton sequence to obtain a coordinate matrix; 2) selecting characteristic parameters according to the coordinate matrix, adaptively selecting a coordinate center, and normalizing the action again to obtain an action coordinate matrix; 3) and denoising the action coordinate matrix by a DTW method, reducing the problems of time dislocation and noise of the action coordinate matrix by an FTP method, and classifying the action coordinate matrix by using an SVM. Compared with the existing behavior recognition method, the method effectively improves the recognition precision. The method can be applied to monitoring, video games and man-machine interaction. The technology mainly aims at motion recognition of short videos, and the main application scene of the technology mainly lies in some entrance guard or security recognition systems, and the recognition effect of long videos is very common. In the prior art, there is a good effect on motion classification of short videos, that is, a short video is input, and the motion classification of the video is output. Related techniques such as C3D, ST-GCN, 2S-AGCN, etc. Such methods are ineffective for motion detection of long videos or video streams. Moreover, the method has high requirements on hardware, and is difficult to achieve practical effects.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a video flow detection method based on human skeleton key points, which mainly aims at the action detection and identification of videos with indefinite length. And the speed of the video motion detection and identification can reach 1 time on a GPU of 2080TI level, so that the video motion detection and identification has practical effect.

The purpose of the invention is realized by the following technical scheme:

a video flow detection method based on human skeleton key points comprises the following steps:

1) capturing m seconds and n frames per second in the video each time by using an m second sliding window to obtain m x n frame images;

2) respectively identifying human skeleton key points of the m-n frame images, and taking top K skeleton key points in each frame, wherein the top K indicates that a plurality of people exist in one image, and the top K skeleton key points need to be taken according to a certain rule, such as the K with the highest confidence level or the K with the highest area.

3) Dividing the interframe skeleton data into a plurality of skeleton sequences according to Euclidean distance, namely one skeleton sequence of one person;

4) and (4) feeding each skeleton sequence into a prediction result of the deep learning network model.

Further, the method 3) further includes a bone data normalization processing method, including:

11) scaling the coordinate data to a height 1080 and adapting the width;

12) translating the entire bone data with the center of the bone as the origin, such that the bone data is independent of image resolution, multiplying the bone data by s0= 1.0;

13) calculating displacement data of key points before a next frame and a previous frame, wherein the first frame is 0, and then multiplying the displacement data by s1= 4.0; wherein s0 is used to adjust the distribution range of the normalized feature data spatial information, and s1 is used to adjust the distribution range of the normalized feature data motion information;

14) and (3) connecting and stacking the skeleton key points and the displacement data together to form input data for training and prediction, and finally obtaining corresponding training data.

Further, the bone center is the middle point of the two hips.

Furthermore, the skeleton data are normalized to be between-0.5 and in the maximum range of the gradient of the activation function tanh, and the training convergence of the deep learning network model is facilitated. In summary, based on a large amount of statistical information, normalized data are observed and approximately distributed in the range of [ -0.5, 0.5 ].

Further, the deep learning network model prediction method is that a bone sequence [ x0, x1, x2, … ] is input into a bidirectional cyclic neural deep learning network model, and a label of each frame is predicted; the output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence; in this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.

Further, the method for making the training data set comprises the following steps:

111) extracting frames of a video to be marked according to 10 frames per second;

112) extracting 10 groups from top to bottom according to image quality;

113) manually marking one group of data, namely putting one action sequence into a corresponding action catalog; the frame number must not be continuous between two action sequences;

114) the residual group data are automatically grouped according to the data marked manually;

115) extracting skeleton key points frame by frame;

116) normalizing the bone key point data according to the normalization mode;

117) randomly combining sequences with training data of 30-70, wherein the sequences comprise action sequences and non-action sequences;

118) the training data is divided into a frame number sequence and a label sequence which are respectively stored in different files. The feature data corresponding to the frame number is also stored in a separate feature file.

Further, detailed description of the single-flow model:

inputting data as normalized bone key point data; one frame is provided, and 1-n frames are supported to be input; the input shape is (batch _ size, seq _ len, flat _ num);

linear change and Tanh activation;

sending the data into a multi-layer bidirectional LSTM deep learning network model;

strengthening sequence label conversion relation by using a CRF layer;

b _ represents the start of an action, I _ represents the continuation of an action, and O represents no action.

The next of O may be O, B _, not I _;

the next of B _ can be I _, not B _, O;

the next of I _ can be I _, O, B _.

The invention has the beneficial effects that: the method can extract the features of the long video, has higher identification accuracy, and is suitable for extracting the features of the long-period streaming media playing and dropping.

Drawings

FIG. 1 is a diagram of normalized data distribution (spatial signature part);

FIG. 2 is a diagram of normalized data distribution (motion profile);

FIG. 3 is a schematic view of a single flow model;

FIG. 4 is a schematic view of a three-stream fusion model;

FIG. 5 is a schematic diagram of linear variation and tanh nonlinear activation of the three-stream data respectively.

Detailed Description

The technical solution of the present invention is further described in detail with reference to the following specific examples, but the scope of the present invention is not limited to the following.

Using a sliding window of m seconds, n frames per second are captured each time for m seconds in the video. And respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame. Then, the interframe skeleton data is divided into a plurality of skeleton sequences according to the Euclidean distance, namely one skeleton sequence for one person, wherein top K indicates that a plurality of persons exist in one picture, and the top K is required to be taken according to a certain rule, such as K with the highest confidence level or K with the highest area.

The original bone data are coordinates in the image, so that the training and the prediction of the deep learning network model are not facilitated, and the bone data are normalized by the method, wherein the specific normalization method is as follows.

The coordinate data is scaled to 1080 height, width adaptive.

The entire bone data is translated with the center of the bone (the mid-point of the two hips) as the origin, so that the bone data is independent of the image resolution, multiplying the bone data by s0= 1.0. The data distribution is as shown in figure 1:

the displacement data of the keypoints before the next frame-the previous frame is calculated, the first frame is 0, and then the displacement data is multiplied by s1= 4.0. The data distribution is as shown in FIG. 2: wherein s0 is used to adjust the distribution range of the normalized feature data spatial information, and s1 is used to adjust the distribution range of the normalized feature data motion information.

And (4) continuously stacking the bone key points and the displacement data together to form input data for training and prediction. And normalizing the data to be between-0.5 and 0.5, wherein the data is in the maximum range of the gradient of the activation function tanh, and the training convergence of the deep learning network model is facilitated.

For example:

1. the original image skeleton key points are extracted from the image, the extracted original image skeleton key points comprise 67 key points, wherein 25 body key point positions, 21 left-hand key point positions and 21 right-hand key point positions are included, each key point is composed of (abscissa and ordinate), and the origin of coordinates is the upper left corner of the image.

Input image example (resolution 544X 960):

output example s1 out: the output is a 134-dimensional array, and every two values are one key point location.

[315, 368, 302, 428, 263, 428, 242, 502, 242, 562, 342, 428, 397, 399, 386, 349, 302, 557, 271, 560, 260, 660, 250, 743, 326, 557, 336, 659, 342, 746, 305, 360, 323, 363, 286, 368, 331, 371, 352, 788, 365, 783, 328, 757, 252, 773, 239, 767, 255, 746, 382, 348, 375, 348, 365, 343, 357, 337, 352, 333, 365, 325, 358, 315, 353, 310, 350, 305, 372, 322, 366, 310, 364, 301, 362, 293, 378, 321, 374, 308, 372, 301, 371, 293, 385, 321, 386, 313, 387, 308, 388, 305, 256, 569, 244, 568, 241, 583, 240, 592, 240, 599, 243, 578, 242, 589, 240, 595, 241, 601, 244, 577, 241, 586, 242, 592, 242, 600, 245, 579, 240, 584, 242, 590, 242, 601, 244, 580, 241, 597, 240, 597, 241, 601]

2. The ordinate scales to 1080 size and the abscissa scales to the same scale.

In this example:

y _ scale = 1080/960=1.125, and all 67 x 2=134 values in the first step are multiplied by y _ scale to yield s2out:

[354.375, 414.0, 339.75, 481.5, 295.875, 481.5, 272.25, 564.75, 272.25, 632.25, 384.75, 481.5, 446.625, 448.875, 434.25, 392.625, 339.75, 626.625, 304.875, 630.0, 292.5, 742.5, 281.25, 835.875, 366.75, 626.625, 378.0, 741.375, 384.75, 839.25, 343.125, 405.0, 363.375, 408.375, 321.75, 414.0, 372.375, 417.375, 396.0, 886.5, 410.625, 880.875, 369.0, 851.625, 283.5, 869.625, 268.875, 862.875, 286.875, 839.25, 429.75, 391.5, 421.875, 391.5, 410.625, 385.875, 401.625, 379.125, 396.0, 374.625, 410.625, 365.625, 402.75, 354.375, 397.125, 348.75, 393.75, 343.125, 418.5, 362.25, 411.75, 348.75, 409.5, 338.625, 407.25, 329.625, 425.25, 361.125, 420.75, 346.5, 418.5, 338.625, 417.375, 329.625, 433.125, 361.125, 434.25, 352.125, 435.375, 346.5, 436.5, 343.125, 288.0, 640.125, 274.5, 639.0, 271.125, 655.875, 270.0, 666.0, 270.0, 673.875, 273.375, 650.25, 272.25, 662.625, 270.0, 669.375, 271.125, 676.125, 274.5, 649.125, 271.125, 659.25, 272.25, 666.0, 272.25, 675.0, 275.625, 651.375, 270.0, 657.0, 272.25, 663.75, 272.25, 676.125, 274.5, 652.5, 271.125, 671.625, 270.0, 671.625, 271.125, 676.125]

3. the coordinate points are normalized to the center point of the human body.

Referring to the "body key point bitmap" in the upper graph, the 8 th key point (s 2out marked red) of the body is taken as the central point, i.e. two values (s 2out [16], s2out [17 ]) in the second step output. The relative positions of 67 keypoints were calculated. I.e., s2out [16] is subtracted from all abscissas and s2out [17] is subtracted from all ordinates.

Taking the first point (354.375, 414.0) as an example, after transformation:

（354.375, 414.0）-（s2out[16], s2out[17]）=（354.375, 414.0）- (339.75, 626.625)

= (354.375-339.75, 414.0-626.625) = (14.625, -212.625), then the obtained relative coordinates are divided by 1080 to obtain (0.01354, -0.19687), and the obtained relative coordinates are multiplied by s0=1.0 to adjust the distribution range of the output value, wherein the default value is 1.0, which is equivalent to no adjustment.

All 67 bits are transformed to get s3out:

[0.01354, -0.19687, 0.0, -0.13437, -0.04063, -0.13437, -0.0625, -0.05729, -0.0625, 0.00521, 0.04167, -0.13437, 0.09896, -0.16458, 0.0875, -0.21667, 0.0, 0.0, -0.03229, 0.00313, -0.04375, 0.10729, -0.05417, 0.19375, 0.025, 0.0, 0.03542, 0.10625, 0.04167, 0.19687, 0.00313, -0.20521, 0.02187, -0.20208, -0.01667, -0.19687, 0.03021, -0.19375, 0.05208, 0.24063, 0.06563, 0.23542, 0.02708, 0.20833, -0.05208, 0.225, -0.06563, 0.21875, -0.04896, 0.19687, 0.08333, -0.21771, 0.07604, -0.21771, 0.06563, -0.22292, 0.05729, -0.22917, 0.05208, -0.23333, 0.06563, -0.24167, 0.05833, -0.25208, 0.05312, -0.25729, 0.05, -0.2625, 0.07292, -0.24479, 0.06667, -0.25729, 0.06458, -0.26667, 0.0625, -0.275, 0.07917, -0.24583, 0.075, -0.25938, 0.07292, -0.26667, 0.07187, -0.275, 0.08646, -0.24583, 0.0875, -0.25417, 0.08854, -0.25938, 0.08958, -0.2625, -0.04792, 0.0125, -0.06042, 0.01146, -0.06354, 0.02708, -0.06458, 0.03646, -0.06458, 0.04375, -0.06146, 0.02187, -0.0625, 0.03333, -0.06458, 0.03958, -0.06354, 0.04583, -0.06042, 0.02083, -0.06354, 0.03021, -0.0625, 0.03646, -0.0625, 0.04479, -0.05937, 0.02292, -0.06458, 0.02813, -0.0625, 0.03438, -0.0625, 0.04583, -0.06042, 0.02396, -0.06354, 0.04167, -0.06458, 0.04167, -0.06354, 0.04583]

a large amount of data was counted to obtain a distribution graph in which the abscissa is the value after normalization, the ordinate is the count of each value, and the unit of the ordinate is million times (1 e 6). Observing fig. 2, it can be seen that the values in this interval (-0.5, 0.5) are mostly, which corresponds to the data roughly normalized to (-0.5, 0.5). If the data needs to be normalized to between (-1, 1), only the parameter s0=2.0 needs to be adjusted.

4. After the steps 1, 2 and 3, the first half part of the final feature data, namely the spatial position feature part, is obtained. There is also a need to continue to acquire motion characteristic information. The process of acquiring the motion characteristic information is relatively simple.

Assuming that two adjacent frames of images, f0 and f1, are respectively transformed by 1 and 2 to obtain s2out0 and s2out1, the motion data is (s 2out 1-s 2out 0)/1080 × s1, and a parameter s1 is introduced to adjust the value range of the motion characteristic part so that the value range of the motion characteristic part is close to that of the spatial characteristic part, thereby being beneficial to subsequent training of the neural deep learning network model. When s1=4.0, profile 2 of the motion feature data.

A large amount of motion characteristic data is counted to obtain a distribution diagram, wherein the abscissa is the value after normalization, the ordinate is the count of each value, and the unit of the ordinate is million times (1 e 6).

5. The spatial feature and the motion feature are spliced together back and forth to form a 268-dimensional feature vector.

Inputting the bone sequence [ x0, x1, x2, … ] into a bidirectional cyclic neural deep learning network model, and predicting the label of each frame. The output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence. In this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.

The bone data were normalized as described above.

And (3) adopting a bidirectional cycle deep learning network model Bi-LSTM + conditional random field CRF.

The training data is randomly scaled over the entire sequence and no-motion sequences must be included between the motion sequences.

Production of training data sets (training video data can only contain a single person).

And (4) extracting frames of the video to be marked according to 10 frames per second.

The 10 groups were extracted from top to bottom with image quality.

And marking one group of data manually, namely putting one action sequence into a corresponding action catalog. The frame number must not be consecutive between two action sequences.

And the residual group data are automatically grouped according to the manually marked data.

Skeletal keypoints are extracted frame by frame.

The skeletal keypoint data is normalized in the manner previously described.

The random combination training data is a sequence of 30-70, wherein the sequence comprises an action sequence and a non-action sequence.

The training data is divided into a frame number sequence and a label sequence which are respectively stored in different files. The feature data corresponding to the frame number is also stored in a separate feature file.

Deep learning network model an end-to-end, single-flow model is used in this embodiment, the principle of which is illustrated with reference to fig. 3.

Three-stream fusion model, as shown with reference to fig. 4:

stream 1: bone data, which is the length between associated bone keypoints. The relationship is, for example, wrist-elbow joint, elbow-key. The data may be generated by spatial feature computation.

Stream 2: joint data, i.e. the spatial part of the normalized feature data mentioned before

Stream 3: motion data.

The three-stream data are respectively linearly changed and tanh is nonlinearly activated, and the principle thereof is shown with reference to fig. 5.

Referring to fig. 3, a detailed description of the single-flow model.

The input data is normalized bone keypoint data. One per frame, supporting input 1-n frames. The input shape is (batch _ size, seq _ len, flat _ num).

By linear change and Tanh activation.

And sending the data into a multi-layer bidirectional LSTM deep learning network model.

And strengthening the sequence label conversion relation by using a CRF layer.

The action sequence needs to start with B _ B.

The next of B _ would not be O.

The next of B _ is the continuation of the action and I _.

Video streaming is detected and identified.

The video is slid over in time sequence with a sliding window of m seconds, step s seconds.

Each window extracts x frames of pictures.

And extracting skeleton key points of each frame of people.

And combining the skeleton key points into a plurality of skeleton sequences according to the Euclidean distance.

And (4) feeding each skeleton sequence into a prediction result of the deep learning network model.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The video flow detection method based on the human skeleton key points is characterized by comprising the following steps:

2) respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame;

2. The method for detecting video flow based on key points of human bones as claimed in claim 1, wherein said 3) further comprises a method for normalizing the bone data, comprising:

11) scaling the coordinate data to a height 1080 and adapting the width;

12) translating the entire bone data with the center of the bone as the origin so that the bone data is independent of the image resolution, multiplying the bone data by s 0;

13) calculating displacement data of key points before a next frame and a previous frame, wherein the first frame is 0, and then multiplying the displacement data by s1, wherein s0 is used for adjusting the distribution range of the normalized feature data spatial information, and s1 is used for adjusting the distribution range of the normalized feature data motion information;

3. The method as claimed in claim 1, wherein the bone center is a mid-point of two hips.

4. The method for detecting video flow based on human skeleton key points as claimed in claim 1, wherein the skeleton data is normalized to-0.5, and is within a maximum range of an activation function tanh gradient, which is beneficial to the training convergence of a deep learning network model.

5. The method for detecting video flow based on key points of human bones as claimed in claim 1, wherein the deep learning network model prediction method is to input the bone sequence [ x0, x1, x2, … ] into a bidirectional cyclic neural deep learning network model to predict the label of each frame; the output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence; in this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.

6. The method for detecting video streaming based on key points of human bones as claimed in claim 1, further comprising a training data set generating method comprising:

112) extracting 10 groups from top to bottom according to image quality;

115) extracting skeleton key points frame by frame;

116) normalizing the bone key point data according to the normalization mode;

118) the training data is divided into a frame number sequence and a label sequence which are respectively stored in different files;

the feature data corresponding to the frame number is also stored in a separate feature file.

7. The method for detecting video flow based on human skeleton key points as claimed in claim 1, wherein the deep learning network model is a single-flow model, and its detailed description is:

linear change and Tanh activation;

strengthening sequence label conversion relation by using a CRF layer;

b _ represents the start of an action, I _ represents the continuation of an action, and O represents no action;

the next of O may be O, B _, not I _;

the next of B _ can be I _, not B _, O;

the next of I _ can be I _, O, B _.