CN112464856A - Video streaming detection method based on human skeleton key points - Google Patents

Video streaming detection method based on human skeleton key points Download PDF

Info

Publication number
CN112464856A
CN112464856A CN202011431363.2A CN202011431363A CN112464856A CN 112464856 A CN112464856 A CN 112464856A CN 202011431363 A CN202011431363 A CN 202011431363A CN 112464856 A CN112464856 A CN 112464856A
Authority
CN
China
Prior art keywords
data
key points
skeleton
action
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011431363.2A
Other languages
Chinese (zh)
Other versions
CN112464856B (en
Inventor
张洋
刘盾
唐学怡
沈余银
宋升�
黄信云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Chinamcloud Technology Co ltd
Original Assignee
Chengdu Chinamcloud Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Chinamcloud Technology Co ltd filed Critical Chengdu Chinamcloud Technology Co ltd
Priority to CN202011431363.2A priority Critical patent/CN112464856B/en
Publication of CN112464856A publication Critical patent/CN112464856A/en
Application granted granted Critical
Publication of CN112464856B publication Critical patent/CN112464856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video stream motion detection method based on human skeleton key points, which is characterized in that a sliding window of m seconds is utilized to intercept m seconds of video and n frames per second. And respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame. And then, the interframe skeleton data is split into a plurality of skeleton sequences according to Euclidean distances, namely one skeleton sequence is used by one person, and the method mainly aims at the action detection and identification of the video with the variable length. And 1 time real-time speed can be achieved on 2080TI level GPU. Therefore, the video motion detection and identification have practical effects.

Description

Video streaming detection method based on human skeleton key points
Technical Field
The invention relates to the field of video identification, in particular to a video flow detection method based on human skeleton key points.
Background
The motion detection is mainly based on a human body posture model, and is used for identifying motion pictures acquired by videos, for example, a Chinese patent with publication (announcement) number CN107194344A discloses a human body behavior identification method adaptive to a skeleton center. The problem that the action recognition precision is low in the prior art is mainly solved. The method comprises the following implementation steps: 1) acquiring a three-dimensional skeleton sequence from the skeleton sequence data set, and preprocessing the three-dimensional skeleton sequence to obtain a coordinate matrix; 2) selecting characteristic parameters according to the coordinate matrix, adaptively selecting a coordinate center, and normalizing the action again to obtain an action coordinate matrix; 3) and denoising the action coordinate matrix by a DTW method, reducing the problems of time dislocation and noise of the action coordinate matrix by an FTP method, and classifying the action coordinate matrix by using an SVM. Compared with the existing behavior recognition method, the method effectively improves the recognition precision. The method can be applied to monitoring, video games and man-machine interaction. The technology mainly aims at motion recognition of short videos, and the main application scene of the technology mainly lies in some entrance guard or security recognition systems, and the recognition effect of long videos is very common. In the prior art, there is a good effect on motion classification of short videos, that is, a short video is input, and the motion classification of the video is output. Related techniques such as C3D, ST-GCN, 2S-AGCN, etc. Such methods are ineffective for motion detection of long videos or video streams. Moreover, the method has high requirements on hardware, and is difficult to achieve practical effects.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a video flow detection method based on human skeleton key points, which mainly aims at the action detection and identification of videos with indefinite length. And the speed of the video motion detection and identification can reach 1 time on a GPU of 2080TI level, so that the video motion detection and identification has practical effect.
The purpose of the invention is realized by the following technical scheme:
a video flow detection method based on human skeleton key points comprises the following steps:
1) capturing m seconds and n frames per second in the video each time by using an m second sliding window to obtain m x n frame images;
2) respectively identifying human skeleton key points of the m-n frame images, and taking top K skeleton key points in each frame, wherein the top K indicates that a plurality of people exist in one image, and the top K skeleton key points need to be taken according to a certain rule, such as the K with the highest confidence level or the K with the highest area.
3) Dividing the interframe skeleton data into a plurality of skeleton sequences according to Euclidean distance, namely one skeleton sequence of one person;
4) and (4) feeding each skeleton sequence into a prediction result of the deep learning network model.
Further, the method 3) further includes a bone data normalization processing method, including:
11) scaling the coordinate data to a height 1080 and adapting the width;
12) translating the entire bone data with the center of the bone as the origin, such that the bone data is independent of image resolution, multiplying the bone data by s0= 1.0;
13) calculating displacement data of key points before a next frame and a previous frame, wherein the first frame is 0, and then multiplying the displacement data by s1= 4.0; wherein s0 is used to adjust the distribution range of the normalized feature data spatial information, and s1 is used to adjust the distribution range of the normalized feature data motion information;
14) and (3) connecting and stacking the skeleton key points and the displacement data together to form input data for training and prediction, and finally obtaining corresponding training data.
Further, the bone center is the middle point of the two hips.
Furthermore, the skeleton data are normalized to be between-0.5 and in the maximum range of the gradient of the activation function tanh, and the training convergence of the deep learning network model is facilitated. In summary, based on a large amount of statistical information, normalized data are observed and approximately distributed in the range of [ -0.5, 0.5 ].
Further, the deep learning network model prediction method is that a bone sequence [ x0, x1, x2, … ] is input into a bidirectional cyclic neural deep learning network model, and a label of each frame is predicted; the output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence; in this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.
Further, the method for making the training data set comprises the following steps:
111) extracting frames of a video to be marked according to 10 frames per second;
112) extracting 10 groups from top to bottom according to image quality;
113) manually marking one group of data, namely putting one action sequence into a corresponding action catalog; the frame number must not be continuous between two action sequences;
114) the residual group data are automatically grouped according to the data marked manually;
115) extracting skeleton key points frame by frame;
116) normalizing the bone key point data according to the normalization mode;
117) randomly combining sequences with training data of 30-70, wherein the sequences comprise action sequences and non-action sequences;
118) the training data is divided into a frame number sequence and a label sequence which are respectively stored in different files. The feature data corresponding to the frame number is also stored in a separate feature file.
Further, detailed description of the single-flow model:
inputting data as normalized bone key point data; one frame is provided, and 1-n frames are supported to be input; the input shape is (batch _ size, seq _ len, flat _ num);
linear change and Tanh activation;
sending the data into a multi-layer bidirectional LSTM deep learning network model;
strengthening sequence label conversion relation by using a CRF layer;
b _ represents the start of an action, I _ represents the continuation of an action, and O represents no action.
The next of O may be O, B _, not I _;
the next of B _ can be I _, not B _, O;
the next of I _ can be I _, O, B _.
The invention has the beneficial effects that: the method can extract the features of the long video, has higher identification accuracy, and is suitable for extracting the features of the long-period streaming media playing and dropping.
Drawings
FIG. 1 is a diagram of normalized data distribution (spatial signature part);
FIG. 2 is a diagram of normalized data distribution (motion profile);
FIG. 3 is a schematic view of a single flow model;
FIG. 4 is a schematic view of a three-stream fusion model;
FIG. 5 is a schematic diagram of linear variation and tanh nonlinear activation of the three-stream data respectively.
Detailed Description
The technical solution of the present invention is further described in detail with reference to the following specific examples, but the scope of the present invention is not limited to the following.
Using a sliding window of m seconds, n frames per second are captured each time for m seconds in the video. And respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame. Then, the interframe skeleton data is divided into a plurality of skeleton sequences according to the Euclidean distance, namely one skeleton sequence for one person, wherein top K indicates that a plurality of persons exist in one picture, and the top K is required to be taken according to a certain rule, such as K with the highest confidence level or K with the highest area.
The original bone data are coordinates in the image, so that the training and the prediction of the deep learning network model are not facilitated, and the bone data are normalized by the method, wherein the specific normalization method is as follows.
The coordinate data is scaled to 1080 height, width adaptive.
The entire bone data is translated with the center of the bone (the mid-point of the two hips) as the origin, so that the bone data is independent of the image resolution, multiplying the bone data by s0= 1.0. The data distribution is as shown in figure 1:
the displacement data of the keypoints before the next frame-the previous frame is calculated, the first frame is 0, and then the displacement data is multiplied by s1= 4.0. The data distribution is as shown in FIG. 2: wherein s0 is used to adjust the distribution range of the normalized feature data spatial information, and s1 is used to adjust the distribution range of the normalized feature data motion information.
And (4) continuously stacking the bone key points and the displacement data together to form input data for training and prediction. And normalizing the data to be between-0.5 and 0.5, wherein the data is in the maximum range of the gradient of the activation function tanh, and the training convergence of the deep learning network model is facilitated.
For example:
1. the original image skeleton key points are extracted from the image, the extracted original image skeleton key points comprise 67 key points, wherein 25 body key point positions, 21 left-hand key point positions and 21 right-hand key point positions are included, each key point is composed of (abscissa and ordinate), and the origin of coordinates is the upper left corner of the image.
Input image example (resolution 544X 960):
output example s1 out: the output is a 134-dimensional array, and every two values are one key point location.
[315, 368, 302, 428, 263, 428, 242, 502, 242, 562, 342, 428, 397, 399, 386, 349, 302, 557, 271, 560, 260, 660, 250, 743, 326, 557, 336, 659, 342, 746, 305, 360, 323, 363, 286, 368, 331, 371, 352, 788, 365, 783, 328, 757, 252, 773, 239, 767, 255, 746, 382, 348, 375, 348, 365, 343, 357, 337, 352, 333, 365, 325, 358, 315, 353, 310, 350, 305, 372, 322, 366, 310, 364, 301, 362, 293, 378, 321, 374, 308, 372, 301, 371, 293, 385, 321, 386, 313, 387, 308, 388, 305, 256, 569, 244, 568, 241, 583, 240, 592, 240, 599, 243, 578, 242, 589, 240, 595, 241, 601, 244, 577, 241, 586, 242, 592, 242, 600, 245, 579, 240, 584, 242, 590, 242, 601, 244, 580, 241, 597, 240, 597, 241, 601]
2. The ordinate scales to 1080 size and the abscissa scales to the same scale.
In this example:
y _ scale = 1080/960=1.125, and all 67 x 2=134 values in the first step are multiplied by y _ scale to yield s2out:
[354.375, 414.0, 339.75, 481.5, 295.875, 481.5, 272.25, 564.75, 272.25, 632.25, 384.75, 481.5, 446.625, 448.875, 434.25, 392.625, 339.75, 626.625, 304.875, 630.0, 292.5, 742.5, 281.25, 835.875, 366.75, 626.625, 378.0, 741.375, 384.75, 839.25, 343.125, 405.0, 363.375, 408.375, 321.75, 414.0, 372.375, 417.375, 396.0, 886.5, 410.625, 880.875, 369.0, 851.625, 283.5, 869.625, 268.875, 862.875, 286.875, 839.25, 429.75, 391.5, 421.875, 391.5, 410.625, 385.875, 401.625, 379.125, 396.0, 374.625, 410.625, 365.625, 402.75, 354.375, 397.125, 348.75, 393.75, 343.125, 418.5, 362.25, 411.75, 348.75, 409.5, 338.625, 407.25, 329.625, 425.25, 361.125, 420.75, 346.5, 418.5, 338.625, 417.375, 329.625, 433.125, 361.125, 434.25, 352.125, 435.375, 346.5, 436.5, 343.125, 288.0, 640.125, 274.5, 639.0, 271.125, 655.875, 270.0, 666.0, 270.0, 673.875, 273.375, 650.25, 272.25, 662.625, 270.0, 669.375, 271.125, 676.125, 274.5, 649.125, 271.125, 659.25, 272.25, 666.0, 272.25, 675.0, 275.625, 651.375, 270.0, 657.0, 272.25, 663.75, 272.25, 676.125, 274.5, 652.5, 271.125, 671.625, 270.0, 671.625, 271.125, 676.125]
3. the coordinate points are normalized to the center point of the human body.
Referring to the "body key point bitmap" in the upper graph, the 8 th key point (s 2out marked red) of the body is taken as the central point, i.e. two values (s 2out [16], s2out [17 ]) in the second step output. The relative positions of 67 keypoints were calculated. I.e., s2out [16] is subtracted from all abscissas and s2out [17] is subtracted from all ordinates.
Taking the first point (354.375, 414.0) as an example, after transformation:
(354.375, 414.0)-(s2out[16], s2out[17])=(354.375, 414.0)- (339.75, 626.625)
= (354.375-339.75, 414.0-626.625) = (14.625, -212.625), then the obtained relative coordinates are divided by 1080 to obtain (0.01354, -0.19687), and the obtained relative coordinates are multiplied by s0=1.0 to adjust the distribution range of the output value, wherein the default value is 1.0, which is equivalent to no adjustment.
All 67 bits are transformed to get s3out:
[0.01354, -0.19687, 0.0, -0.13437, -0.04063, -0.13437, -0.0625, -0.05729, -0.0625, 0.00521, 0.04167, -0.13437, 0.09896, -0.16458, 0.0875, -0.21667, 0.0, 0.0, -0.03229, 0.00313, -0.04375, 0.10729, -0.05417, 0.19375, 0.025, 0.0, 0.03542, 0.10625, 0.04167, 0.19687, 0.00313, -0.20521, 0.02187, -0.20208, -0.01667, -0.19687, 0.03021, -0.19375, 0.05208, 0.24063, 0.06563, 0.23542, 0.02708, 0.20833, -0.05208, 0.225, -0.06563, 0.21875, -0.04896, 0.19687, 0.08333, -0.21771, 0.07604, -0.21771, 0.06563, -0.22292, 0.05729, -0.22917, 0.05208, -0.23333, 0.06563, -0.24167, 0.05833, -0.25208, 0.05312, -0.25729, 0.05, -0.2625, 0.07292, -0.24479, 0.06667, -0.25729, 0.06458, -0.26667, 0.0625, -0.275, 0.07917, -0.24583, 0.075, -0.25938, 0.07292, -0.26667, 0.07187, -0.275, 0.08646, -0.24583, 0.0875, -0.25417, 0.08854, -0.25938, 0.08958, -0.2625, -0.04792, 0.0125, -0.06042, 0.01146, -0.06354, 0.02708, -0.06458, 0.03646, -0.06458, 0.04375, -0.06146, 0.02187, -0.0625, 0.03333, -0.06458, 0.03958, -0.06354, 0.04583, -0.06042, 0.02083, -0.06354, 0.03021, -0.0625, 0.03646, -0.0625, 0.04479, -0.05937, 0.02292, -0.06458, 0.02813, -0.0625, 0.03438, -0.0625, 0.04583, -0.06042, 0.02396, -0.06354, 0.04167, -0.06458, 0.04167, -0.06354, 0.04583]
a large amount of data was counted to obtain a distribution graph in which the abscissa is the value after normalization, the ordinate is the count of each value, and the unit of the ordinate is million times (1 e 6). Observing fig. 2, it can be seen that the values in this interval (-0.5, 0.5) are mostly, which corresponds to the data roughly normalized to (-0.5, 0.5). If the data needs to be normalized to between (-1, 1), only the parameter s0=2.0 needs to be adjusted.
4. After the steps 1, 2 and 3, the first half part of the final feature data, namely the spatial position feature part, is obtained. There is also a need to continue to acquire motion characteristic information. The process of acquiring the motion characteristic information is relatively simple.
Assuming that two adjacent frames of images, f0 and f1, are respectively transformed by 1 and 2 to obtain s2out0 and s2out1, the motion data is (s 2out 1-s 2out 0)/1080 × s1, and a parameter s1 is introduced to adjust the value range of the motion characteristic part so that the value range of the motion characteristic part is close to that of the spatial characteristic part, thereby being beneficial to subsequent training of the neural deep learning network model. When s1=4.0, profile 2 of the motion feature data.
A large amount of motion characteristic data is counted to obtain a distribution diagram, wherein the abscissa is the value after normalization, the ordinate is the count of each value, and the unit of the ordinate is million times (1 e 6).
5. The spatial feature and the motion feature are spliced together back and forth to form a 268-dimensional feature vector.
Inputting the bone sequence [ x0, x1, x2, … ] into a bidirectional cyclic neural deep learning network model, and predicting the label of each frame. The output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence. In this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.
The bone data were normalized as described above.
And (3) adopting a bidirectional cycle deep learning network model Bi-LSTM + conditional random field CRF.
The training data is randomly scaled over the entire sequence and no-motion sequences must be included between the motion sequences.
Production of training data sets (training video data can only contain a single person).
And (4) extracting frames of the video to be marked according to 10 frames per second.
The 10 groups were extracted from top to bottom with image quality.
And marking one group of data manually, namely putting one action sequence into a corresponding action catalog. The frame number must not be consecutive between two action sequences.
And the residual group data are automatically grouped according to the manually marked data.
Skeletal keypoints are extracted frame by frame.
The skeletal keypoint data is normalized in the manner previously described.
The random combination training data is a sequence of 30-70, wherein the sequence comprises an action sequence and a non-action sequence.
The training data is divided into a frame number sequence and a label sequence which are respectively stored in different files. The feature data corresponding to the frame number is also stored in a separate feature file.
Deep learning network model an end-to-end, single-flow model is used in this embodiment, the principle of which is illustrated with reference to fig. 3.
Three-stream fusion model, as shown with reference to fig. 4:
stream 1: bone data, which is the length between associated bone keypoints. The relationship is, for example, wrist-elbow joint, elbow-key. The data may be generated by spatial feature computation.
Stream 2: joint data, i.e. the spatial part of the normalized feature data mentioned before
Stream 3: motion data.
The three-stream data are respectively linearly changed and tanh is nonlinearly activated, and the principle thereof is shown with reference to fig. 5.
Referring to fig. 3, a detailed description of the single-flow model.
The input data is normalized bone keypoint data. One per frame, supporting input 1-n frames. The input shape is (batch _ size, seq _ len, flat _ num).
By linear change and Tanh activation.
And sending the data into a multi-layer bidirectional LSTM deep learning network model.
And strengthening the sequence label conversion relation by using a CRF layer.
The action sequence needs to start with B _ B.
The next of B _ would not be O.
The next of B _ is the continuation of the action and I _.
Video streaming is detected and identified.
The video is slid over in time sequence with a sliding window of m seconds, step s seconds.
Each window extracts x frames of pictures.
And extracting skeleton key points of each frame of people.
And combining the skeleton key points into a plurality of skeleton sequences according to the Euclidean distance.
And (4) feeding each skeleton sequence into a prediction result of the deep learning network model.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. The video flow detection method based on the human skeleton key points is characterized by comprising the following steps:
1) capturing m seconds and n frames per second in the video each time by using an m second sliding window to obtain m x n frame images;
2) respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame;
3) dividing the interframe skeleton data into a plurality of skeleton sequences according to Euclidean distance, namely one skeleton sequence of one person;
4) and (4) feeding each skeleton sequence into a prediction result of the deep learning network model.
2. The method for detecting video flow based on key points of human bones as claimed in claim 1, wherein said 3) further comprises a method for normalizing the bone data, comprising:
11) scaling the coordinate data to a height 1080 and adapting the width;
12) translating the entire bone data with the center of the bone as the origin so that the bone data is independent of the image resolution, multiplying the bone data by s 0;
13) calculating displacement data of key points before a next frame and a previous frame, wherein the first frame is 0, and then multiplying the displacement data by s1, wherein s0 is used for adjusting the distribution range of the normalized feature data spatial information, and s1 is used for adjusting the distribution range of the normalized feature data motion information;
14) and (3) connecting and stacking the skeleton key points and the displacement data together to form input data for training and prediction, and finally obtaining corresponding training data.
3. The method as claimed in claim 1, wherein the bone center is a mid-point of two hips.
4. The method for detecting video flow based on human skeleton key points as claimed in claim 1, wherein the skeleton data is normalized to-0.5, and is within a maximum range of an activation function tanh gradient, which is beneficial to the training convergence of a deep learning network model.
5. The method for detecting video flow based on key points of human bones as claimed in claim 1, wherein the deep learning network model prediction method is to input the bone sequence [ x0, x1, x2, … ] into a bidirectional cyclic neural deep learning network model to predict the label of each frame; the output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence; in this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.
6. The method for detecting video streaming based on key points of human bones as claimed in claim 1, further comprising a training data set generating method comprising:
111) extracting frames of a video to be marked according to 10 frames per second;
112) extracting 10 groups from top to bottom according to image quality;
113) manually marking one group of data, namely putting one action sequence into a corresponding action catalog; the frame number must not be continuous between two action sequences;
114) the residual group data are automatically grouped according to the data marked manually;
115) extracting skeleton key points frame by frame;
116) normalizing the bone key point data according to the normalization mode;
117) randomly combining sequences with training data of 30-70, wherein the sequences comprise action sequences and non-action sequences;
118) the training data is divided into a frame number sequence and a label sequence which are respectively stored in different files;
the feature data corresponding to the frame number is also stored in a separate feature file.
7. The method for detecting video flow based on human skeleton key points as claimed in claim 1, wherein the deep learning network model is a single-flow model, and its detailed description is:
inputting data as normalized bone key point data; one frame is provided, and 1-n frames are supported to be input; the input shape is (batch _ size, seq _ len, flat _ num);
linear change and Tanh activation;
sending the data into a multi-layer bidirectional LSTM deep learning network model;
strengthening sequence label conversion relation by using a CRF layer;
b _ represents the start of an action, I _ represents the continuation of an action, and O represents no action;
the next of O may be O, B _, not I _;
the next of B _ can be I _, not B _, O;
the next of I _ can be I _, O, B _.
CN202011431363.2A 2020-12-09 2020-12-09 Video streaming detection method based on key points of human bones Active CN112464856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011431363.2A CN112464856B (en) 2020-12-09 2020-12-09 Video streaming detection method based on key points of human bones

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011431363.2A CN112464856B (en) 2020-12-09 2020-12-09 Video streaming detection method based on key points of human bones

Publications (2)

Publication Number Publication Date
CN112464856A true CN112464856A (en) 2021-03-09
CN112464856B CN112464856B (en) 2023-06-13

Family

ID=74801107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011431363.2A Active CN112464856B (en) 2020-12-09 2020-12-09 Video streaming detection method based on key points of human bones

Country Status (1)

Country Link
CN (1) CN112464856B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710802A (en) * 2018-12-20 2019-05-03 百度在线网络技术(北京)有限公司 Video classification methods and its device
CN110263666A (en) * 2019-05-29 2019-09-20 西安交通大学 A kind of motion detection method based on asymmetric multithread
CN110348321A (en) * 2019-06-18 2019-10-18 杭州电子科技大学 Human motion recognition method based on bone space-time characteristic and long memory network in short-term
US20190332866A1 (en) * 2018-04-26 2019-10-31 Fyusion, Inc. Method and apparatus for 3-d auto tagging
CN110991274A (en) * 2019-11-18 2020-04-10 杭州电子科技大学 Pedestrian tumbling detection method based on Gaussian mixture model and neural network
CN111383421A (en) * 2018-12-30 2020-07-07 奥瞳***科技有限公司 Privacy protection fall detection method and system
CN111680562A (en) * 2020-05-09 2020-09-18 北京中广上洋科技股份有限公司 Human body posture identification method and device based on skeleton key points, storage medium and terminal
CN111680613A (en) * 2020-06-03 2020-09-18 安徽大学 Method for detecting falling behavior of escalator passengers in real time

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190332866A1 (en) * 2018-04-26 2019-10-31 Fyusion, Inc. Method and apparatus for 3-d auto tagging
CN109710802A (en) * 2018-12-20 2019-05-03 百度在线网络技术(北京)有限公司 Video classification methods and its device
CN111383421A (en) * 2018-12-30 2020-07-07 奥瞳***科技有限公司 Privacy protection fall detection method and system
CN110263666A (en) * 2019-05-29 2019-09-20 西安交通大学 A kind of motion detection method based on asymmetric multithread
CN110348321A (en) * 2019-06-18 2019-10-18 杭州电子科技大学 Human motion recognition method based on bone space-time characteristic and long memory network in short-term
CN110991274A (en) * 2019-11-18 2020-04-10 杭州电子科技大学 Pedestrian tumbling detection method based on Gaussian mixture model and neural network
CN111680562A (en) * 2020-05-09 2020-09-18 北京中广上洋科技股份有限公司 Human body posture identification method and device based on skeleton key points, storage medium and terminal
CN111680613A (en) * 2020-06-03 2020-09-18 安徽大学 Method for detecting falling behavior of escalator passengers in real time

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ROMERO MORAIS 等: "Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, pages 11988 - 11996 *
ZHEN QIN 等: "Learning Local Part Motion Representation for Skeleton-based Action Recognition", 《PROCEEDINGS OF THE 2019 11TH INTERNATIONAL CONFERENCE ON EDUCATION TECHNOLOGY AND COMPUTERS》, pages 295 - 299 *
时俊: "基于GCN人体行为识别***的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 1, pages 138 - 1346 *
许政: "基于深度学习的人体骨架点检测", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 1, pages 138 - 1603 *

Also Published As

Publication number Publication date
CN112464856B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
CN110287805B (en) Micro-expression identification method and system based on three-stream convolutional neural network
Cho et al. Self-attention network for skeleton-based human action recognition
CN108537743B (en) Face image enhancement method based on generation countermeasure network
WO2020155873A1 (en) Deep apparent features and adaptive aggregation network-based multi-face tracking method
Jiang et al. Action unit detection using sparse appearance descriptors in space-time video volumes
Levy et al. Live repetition counting
CN110084130B (en) Face screening method, device, equipment and storage medium based on multi-target tracking
CN112418095A (en) Facial expression recognition method and system combined with attention mechanism
Barnich et al. Frontal-view gait recognition by intra-and inter-frame rectangle size distribution
Tang et al. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition
CN111062314B (en) Image selection method and device, computer readable storage medium and electronic equipment
CN115880784A (en) Scenic spot multi-person action behavior monitoring method based on artificial intelligence
Lertniphonphan et al. Human action recognition using direction histograms of optical flow
CN112149557B (en) Person identity tracking method and system based on face recognition
TW200910221A (en) Method of determining motion-related features and method of performing motion classification
Flórez et al. Hand gesture recognition following the dynamics of a topology-preserving network
JP2014116716A (en) Tracking device
Cui et al. Pose-appearance relational modeling for video action recognition
CN112464856A (en) Video streaming detection method based on human skeleton key points
Zheng et al. A normalized light CNN for face recognition
Li et al. A novel motion based lip feature extraction for lip-reading
Torpey et al. Human action recognition using local two-stream convolution neural network features and support vector machines
Deotale et al. Optimized hybrid RNN model for human activity recognition in untrimmed video
He et al. MTRFN: Multiscale temporal receptive field network for compressed video action recognition at edge servers
Zhao et al. Research on human behavior recognition in video based on 3DCCA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant