CN109583315B

CN109583315B - Multichannel rapid human body posture recognition method for intelligent video monitoring

Info

Publication number: CN109583315B
Application number: CN201811299870.8A
Authority: CN
Inventors: 赵霞; 李磊; 于重重; 管文化; 赵松; 冯泽骁
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2023-05-12
Anticipated expiration: 2038-11-02
Also published as: CN109583315A

Abstract

The invention realizes a multichannel rapid human body posture recognition method oriented to intelligent video monitoring, and builds a multichannel rapid human body posture recognition system architecture; the forwarding server receives a client request, acquires a video stream from the network video recorder, and selects a key frame for format conversion; the method comprises the steps of performing moving target rapid detection and human body detection on key frames, sending video frames with human bodies to an intelligent analysis server, performing gesture recognition by the intelligent analysis server, and returning recognition results to a forwarding server; the forwarding server forwards the identification result and the video stream to the client, and the client displays the result on an interface; under the condition of supporting multiple paths of video channels, human body detection and gesture recognition are realized faster, the influence of complex backgrounds and clothes is shielded, the recognition speed and accuracy are obviously improved, and the method has good application prospect and market value.

Description

Multichannel rapid human body posture recognition method for intelligent video monitoring

Technical Field

The invention relates to human body gesture recognition in a monitoring video, in particular to a multichannel rapid human body gesture recognition method for intelligent video monitoring, and belongs to the fields of real-time streaming media and computer vision.

Background

The main function of most video monitoring systems at present is to record and display scenes of a monitoring site, so that real-time intelligent analysis can not be performed on the monitoring site, security personnel are required to view video pictures of the monitoring site in real time, the behaviors and scene properties of the personnel are judged, and the problems of fatigue, omission, misjudgment and the like which are different from person to person can occur. Therefore, the existing video monitoring system plays a great role in post evidence collection, but is difficult to meet the requirements of timeliness, accuracy, intellectualization and high efficiency required by the video monitoring system for current security early warning and security work in the aspects of automatically identifying the site case and finding the site case.

The intelligent video monitoring system is a new technology in the current video monitoring field, adopts technologies such as computer vision, image processing, pattern recognition and the like, automatically analyzes video images shot by a camera, detects targets in a scene, recognizes and tracks the targets, analyzes and understands the behaviors of the targets on the basis, and provides information useful for monitoring and early warning. The intelligent video monitoring system is not only applied to the field of safety precaution, but also has wide application prospect and great economic value in the fields of traffic, military, finance, industry and the like.

In many systems requiring intelligent video analysis and processing, such as action classification, abnormal behavior detection, automatic driving, etc., it is important to describe human body gestures and predict human body behaviors. The human body posture recognition technology can be applied to household monitoring and teaching applications, such as recognizing the fall of the elderly in the solitary, and knowing the teaching state of students. By analyzing the video stream of the video monitoring system in real time, the special human body gesture is identified, and the response is timely made, such as rescue of the old, teaching adjustment in a classroom and the like.

The human body gesture recognition is a process of giving an image or a video, detecting the positions of human skeleton joints in the image or the video, and classifying and labeling the human body gesture according to the structural characteristics of the joints. Wherein human skeletal joint detection is a key step in human gesture recognition. With the development of deep learning technology, the detection effect of human skeletal joints is continuously improved, and the method has been applied to the related fields of computer vision and has been focused by researchers.

The pedestrian detection method based on video monitoring (CN 201010227766.5) adopts the expanded gradient histogram feature and the Adaboost algorithm to rapidly detect pedestrians, and then uses the gradient histogram feature and a support vector machine to further identify and verify the pedestrians detected in the front. The patent CN20110566809.1 discloses a pedestrian detection method in intelligent video monitoring, which adopts a support vector machine to train a pedestrian detector model, classifies each pedestrian detection window in a picture by using the support vector machine, and fuses the detection windows to obtain a final pedestrian detection result. The intelligent detection and early warning method of the salient events of the monitoring video of the active visual attention (CN 201710181799.2) establishes a rapid extraction method of primary information of the visual attention from bottom to top, establishes an active detection model of a dynamic target, then tracks the salient targets by applying a particle swarm algorithm, and establishes an active early warning model of salient time in the monitoring video at the same time, thereby realizing the intelligent detection and early warning system of the salient events of the monitoring video based on the visual attention model. The patent 'a human body gesture recognition method based on OptiTrack' (CN 201711120678.3) adopts a local linear embedding algorithm to extract gesture features of a training sample, and uses a dimension reduction thought to bring key semantic frames into the gesture features of the training sample, so as to classify the features of the key semantic frames, thereby realizing classification and recognition of gestures; the patent "learner gesture recognition method" (CN201710457825. X) proposes a learner gesture recognition method, which comprises separating a portrait from a background, extracting a contour image of a learner by using mathematical morphological operation on a binarized image, extracting features by using a Zernike matrix, training feature vectors by using a support vector machine, and recognizing the gesture of the learner.

The research results mainly focus on algorithms of pedestrian detection and event detection, and adopt a traditional machine learning method to extract features and detect and identify, and do not relate to a flow and a method for fusing the algorithms with a video forwarding processing flow of a video monitoring system. The method directly acquires real-time image frames from video monitoring equipment, detects the positions of human body joint points based on a human body skeleton joint point detection algorithm of a deep neural network, extracts human body structural skeleton information and performs gesture recognition; the interference of the complex background of the picture and the clothing of the person on the identification effect is avoided. The invention designs a flow mechanism for forwarding, detecting and identifying multi-path video streams, so that the server can efficiently forward each path of video frame and can rapidly detect human targets and identify human body gestures.

Disclosure of Invention

The invention aims to realize a multichannel rapid human body posture recognition method oriented to intelligent video monitoring, and builds a multichannel rapid human body posture recognition system architecture; the forwarding server receives a client request, acquires a video stream from the network video recorder, and selects a key frame for format conversion; the method comprises the steps of performing moving target rapid detection and human body detection on key frames, sending video frames with human bodies to an intelligent analysis server, performing gesture recognition by the intelligent analysis server, and returning recognition results to a forwarding server; and the forwarding server forwards the identification result and the video stream to the client, and the client displays the result on the interface. Specifically, the method of the present invention comprises the steps of:

A. building a multichannel rapid human body posture recognition system architecture oriented to intelligent video monitoring;

A1. the system comprises a client, a video forwarding server (forwarding server for short), an intelligent analysis server and a network video recorder;

A2. the client sends a request for acquiring the video stream to the forwarding server, and displays the video image and the identification result to the user;

A3. the forwarding server receives a client request, acquires a requested video stream from the network video recorder and forwards the requested video stream to the client;

A4. the forwarding server performs rapid detection and rapid human body detection on the moving target, sends a gesture recognition request to the intelligent analysis server, receives a recognition result and sends the recognition result to the client;

A5. the intelligent analysis server receives the identification request, identifies the gesture and returns an identification result to the forwarding server;

A6. the intelligent analysis servers communicate video data and identification information through the network control port and the data port;

B. the forwarding server receives a client request, acquires a video stream, forwards the video stream to the client, and creates a detection sub-thread for rapid detection, and the specific steps are as follows:

B1. the main thread receives a client request and acquires a video stream of a requested channel from the network video recorder;

B2. the main thread creates a ring buffer queue and a forwarding sub-thread for each path of video channel;

B2.1. creating an annular buffer queue to store forwarding data packets corresponding to all channels;

B2.2. creating a forwarding data packet for the acquired video frame, and mounting the forwarding data packet on an annular buffer queue of each channel;

the forwarding data packet comprises a data head and a video frame buffer area; the data head comprises, but is not limited to, video frame size, format, time t, gesture recognition result information and a key frame mark nIDR, wherein a value of 1 represents a key frame, and a value of 0 represents a non-key frame;

B2.3. creating a forwarding sub-thread for taking video frames from the annular buffer queue, and forwarding the video frames to a client for real-time display;

B3. the method comprises the steps of creating a detection sub-thread for acquiring video frames from a ring buffer queue, rapidly detecting and sending gesture recognition requests;

C. the detection sub-thread corresponding to the channel acquires the key frame and performs format conversion, and the specific steps are as follows:

C1. the detection sub-thread acquires a forwarding data packet from a ring buffer queue of the channel to which the detection sub-thread belongs;

C2. selecting a video frame with ndidr=1 as a key frame for subsequent processing;

C3. decoding the H.264 format video frame into YUV format, and converting into JPG format;

D. the detection sub-thread carries out the rapid detection of the moving target on the key frame, and the specific steps are as follows:

D1. the gray level difference value of 3 continuous video frames is calculated, and the specific steps are as follows:

D1.1. let t _n-1 、t _n 、t _n+1 The gray value of the video frame at the moment is marked as F _n-1 ，F _n ，F _n+1 ；

D1.2. Separately computing video frames F _n And F is equal to _n-1 ，F _n+1 And F is equal to _n Respectively as t _n-1 Time sum t _n Time foreground image D _n-1 And D _n ；

D2. The method comprises the following specific steps of performing rapid moving target detection on a foreground image, wherein the image containing a moving target is recorded as Rn':

D2.1. pair D _n-1 And D _n Intersection calculation is carried out to obtain D _n ’；

D2.2. According to the preset threshold T1, pair D _n Each pixel point in' is subjected to binarization processing to obtain a binarized image Rn, wherein the value of T1 comprises but is not limited to 10; the method comprises the following specific steps:

D2.2.1. let D be D _n Gray values of the pixels in';

D2.2.2. if d > T1, R is recorded _n =255, i.e. the motion target point;

D2.2.3. if d < = T1, R is recorded _n =0, i.e. background point;

D2.3. counting the number of pixel points with the pixel value of 255 in Rn, if the number of the pixel points is larger than a preset threshold T2, considering that a moving target exists in the video, and marking the image as Rn'; when the corresponding image resolution is 464 x 464, the value of T2 is 30000;

E. the detection sub-thread detects human body of Rn' and sends a gesture recognition request to the intelligent analysis server, and the specific steps are as follows:

E1. loading parameters of a deep neural network model (network for short) including anchor frame information;

the anchor frame information refers to the width and height information of the anchor frame obtained by clustering during training of the network; the anchor frame refers to N bounding boxes with highest occurrence probability and used for predicting targets, wherein N comprises but is not limited to 5;

E2. processing the resolution of an input image to 464 x 464, and processing the input image through a network convolution layer and a pooling layer to obtain a feature map with the resolution of 13 x 13;

E3. the target frame is predicted for the feature map through a network prediction layer, and the specific steps are as follows:

e3.1, predicting information of M target frames, target frame types and corresponding confidence and type probabilities of each pixel on the feature map by using anchor frames;

the information of the target frame comprises: the offset of the center of the target frame relative to the pixel, and the width and height of the target frame, are denoted (x, y, w, h);

the confidence level represents the accuracy of the predicted target frame position information;

the class probability represents the probability of predicting the class of the target frame as a human body;

E3.2. filtering out target frames with confidence below a preset threshold T, wherein the threshold T comprises but is not limited to 0.7;

E3.3. removing the repeated target frames by using the maximum inhibition to the remained target frames;

E3.4. selecting a target frame with highest category probability, and outputting coordinates of a lower left corner and an upper right corner;

E4. if the image frames detected in the step E contain human bodies, packaging original image frames with human bodies and target frame information into gesture recognition requests every K frames, and sending the gesture recognition requests to an intelligent analysis server, wherein the K value is more than or equal to 1;

E5. if the image frame detected in the step E does not have a human body, the subsequent processing is not carried out;

F. the intelligent analysis server receives the gesture recognition request, performs gesture recognition processing, and returns a result to the forwarding server, and the specific steps are as follows:

F1. the main thread creates an identification sub-thread and an identification buffer queue for each video channel, and the specific steps are as follows:

F1.1. the main thread receives a gesture recognition request sent from a forwarding server, and mounts the received image frames to a recognition buffer queue corresponding to the channel;

F1.2. the identification sub-thread corresponding to the channel acquires the image frames in the identification buffer queue and marks the image frames as an original picture S;

F2. the identification sub-thread loads a deep neural network model formed by four-level networks, and extracts characteristic information of human joints in an original picture S, and the specific steps are as follows:

F2.1. the first level network comprises two paths of network N ₁₁ And N ₁₂ Generating a feature map F ¹ ₁ ～F ¹ ₁₄ The method comprises the following specific steps:

F2.1.1.N ₁₁ the network extracts the characteristics on the original picture S by utilizing a plurality of residual modules, and outputs 14 paths of characteristic diagrams F _11-1 ～F _11-14 ；N ₁₂ The network firstly downsamples the original picture S, then passes through a plurality of residual modules, and then upsamples to output 14 paths of characteristic diagrams F _12-1 ～F _12-14 The method comprises the steps of carrying out a first treatment on the surface of the Each path of characteristic diagram corresponds to a gateway node with highest Gaussian response;

the 14 joint points corresponding to the 14 paths of feature maps comprise: head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, left ankle;

F2.1.2. based on F2.1.1 characteristic diagram, introducing encoder and decoder to generate characteristic diagram F 'with weight W' _11-1 ～F‘ _11-14 And F' _12-1 ～F‘ _12-14 The method comprises the following specific steps:

F2.1.2.1. the encoder uniformly divides each feature map of F2.1.1 into a plurality of areas, which are called feature area a, and the code corresponding to each area is y;

the set of characteristic regions a is denoted as R ^D ＝{a ₁ ，…，a _i ，…，a _L (wherein a) _i An i-th region, 1 < =i < =l, and L represents the number of divided regions; a, a _i ∈R ^D ，R ^D A complete feature map with D as a cutting unit, where D represents the number of pixels in the region, and if the region size is 14×14, d=196;

the code y is a 14-dimensional vector, wherein 14 is the number of joints; the coding of the i-th region is denoted y _i ，y _i The j-th element in the vector is denoted as y _ij ，y _ij =1 indicates that the i-th region contains the j-th node; y is _ij =0 indicates that the ith region does not contain the jth node; the set of Y is denoted as Y ^D ＝{y ₁ ，…，y _i ，…，y _L }，1＜＝i＜＝L；

F2.1.2.2. The decoder calculates a weight W for each feature map of F2.1.1 _11-1 ～W _11-14 And W is _12-1 ～W _12-14 : each weight W is the weight of all feature regions in the feature mapThe set of heavy W, denoted w= { W ₁ ，…，w _i ，…，w _L -a }; weights w of each feature region _i Representing the proportionality coefficient occupied by the characteristic region when the characteristic region is input to the next stage of processing, and calculating by utilizing the characteristic region a and the code y;

F2.1.3. fusion profile F' _11-1 ～F‘ _11-14 And F' _12-1 ～F‘ _12-14 Obtaining the output characteristic diagram F of the network of the stage ¹ ₁ ～F ¹ ₁₄ ；

F2.1.4. Original picture S and feature map F ¹ ₁ ～F ¹ ₁₄ As input to the next level of network;

F2.2. the second level sub-network comprises two paths of network N ₂₁ And N ₂₂ Feature map F output by first level subnetwork ¹ ₁ ～F ¹ ₁₄ And the original picture S is taken as input, the specific steps of the step F2.1 are repeated to obtain output F ² ₁ ～F ² ₁₄ ；

F2.3. The third level sub-network comprises two paths of network N ₃₁ And N ₃₂ Feature map F output by second level subnetwork ² ₁ ～F ² ₁₄ And the original picture S is taken as input, and the specific steps under the step F2.1 are repeated to obtain output F ³ ₁ ～F ³ ₁₄ ；

F2.4. The fourth level sub-network comprises two paths of network N ₄₁ And N ₄₂ Feature map F output by third-order subnetwork ³ ₁ ～F ³ ₁₄ And the original picture S is taken as input, and the specific steps under the step F2.1 are repeated to obtain output F ⁴ ₁ ～F ⁴ ₁₄ ；

F2.5. Using feature maps F ⁴ ₁ ～F ⁴ ₁₄ Extracting the names and coordinates of all joints, connecting adjacent joints according to a physiological common sense, calculating the distance between joint points, and constructing a skeleton diagram; the skeleton diagram comprises names and coordinates of all the joint points and the distance between the joint points;

F3. the recognition sub-thread classifies the gesture of the skeleton map by using an SVM classifier, and the specific steps are as follows:

F3.1. loading trained SVM classifiers, identifying poses including, but not limited to kneeling, lying, sitting, standing;

F3.2. classifying the skeleton diagram, and returning the identification result to the forwarding server; the recognition result comprises a skeleton diagram and the gesture category and accuracy of the skeleton diagram;

G. the main thread of the forwarding server receives the identification result and forwards the identification result to the client, and the client displays the result on an interface; the method comprises the following specific steps:

G1. the main thread receives the identification result, and writes the identification result into the latest forwarding data packet in the annular buffer queue of the corresponding channel;

G2. the forwarding sub-thread of the corresponding channel acquires a forwarding data packet from the forwarding buffer queue and sends the forwarding data packet to the client;

G3. the client receives the forwarding data packet containing the identification result, extracts the video frame and the identification information, and displays the video frame and the identification information on the client interface.

The invention provides a human body detection and gesture recognition method based on a depth neural network in rapid real time, which comprises three stages of moving target rapid detection, human body rapid detection and human body gesture recognition. The three tasks are completed in parallel by adopting a multi-channel concurrency mechanism and a distributed processing mechanism, so that the parallel computing capacity of a multiprocessor can be better utilized in a multi-core processor system, and the human body detection and gesture recognition can be realized more quickly under the condition of supporting multiple paths of video channels simultaneously. The deep neural network is adopted to carry out gesture recognition, the human body joint point information is extracted, and the skeleton information is constructed, so that the influence of noise such as complex background, clothes and the like can be shielded, the recognition speed and accuracy are obviously improved, the hardware cost is reduced, and the method has good application prospect and market value.

Drawings

Fig. 1: a flow chart of a multichannel rapid human body gesture recognition method for intelligent video monitoring;

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples.

The invention is further described below by way of example of yoga teaching video gesture recognition according to implementation steps. The experimental environment is as follows:

/>

the method flow is shown in figure 1, and the method comprises the following steps: 1) Building a multichannel rapid human body gesture recognition system oriented to intelligent video monitoring; 2) The detection sub-thread carries out rapid detection on the moving target; 3) The detection sub-thread carries out human body rapid detection; 4) Quickly recognizing human body posture; the intelligent analysis server carries out rapid human body gesture recognition on the human body video frame by adopting a neural network, and sends a recognition result to the forwarding server; 5) The client presents a gesture recognition result; and the forwarding server forwards the identification result and the corresponding video stream to the client, the client receives the forwarding data packet containing the identification result, and the client displays the gesture identification result on the interface. The invention is further described in terms of steps in connection with system examples as follows:

1. building a multichannel rapid human body posture recognition system architecture oriented to intelligent video monitoring;

1.1. the system comprises a client, a video forwarding server (forwarding server for short), an intelligent analysis server and a network video recorder;

1.2. the client sends a request for acquiring yoga teaching video to the forwarding server, and video images and recognition results are displayed to a user;

1.3. the forwarding server receives a client request, acquires yoga teaching video stream from the network video recorder and forwards the yoga teaching video stream to the client;

1.4. the forwarding server performs rapid detection and rapid human body detection on the moving target, sends a gesture recognition request to the intelligent analysis server, receives a recognition result and sends the recognition result to the client;

1.5. the intelligent analysis server receives the identification request, identifies the gesture and returns an identification result to the forwarding server;

1.6. the intelligent analysis servers communicate video data and identification information through the network control port and the data port;

2. the detection sub-thread carries out the rapid detection of the moving target on the key frame, and the specific steps are as follows:

2.1. the main thread receives a client request and acquires a video stream of a requested channel from the network video recorder;

2.2. the main thread creates a ring buffer queue and a forwarding sub-thread for each path of video channel;

2.3. the detection sub-thread corresponding to the channel acquires the video frame and performs format conversion, and the specific steps are as follows:

2.3.1. the detection sub-thread acquires a forwarding data packet from the channel ring buffer queue to which the detection sub-thread belongs;

2.3.2. selecting a video frame with ndidr=1 as a key frame for subsequent processing;

2.3.3. decoding the H.264 format video frame into YUV format, and converting into JPG format video frame;

2.4. the detection sub-thread carries out the rapid detection of the moving target on the key frame, and the specific steps are as follows:

2.4.1. calculating the gray level difference value of 3 continuous video frames to obtain t _n-1 Time sum t _n Time foreground image D _n-1 And D _n ；

2.4.2. Intersection and binarization calculation are carried out on the foreground image to obtain R _n An image;

2.4.3. counting the number of points with the Rn pixel value of 255, and if the number of points is larger than a set threshold value 30000, considering that a moving target exists in the video, and marking the moving target as Rn';

3. the detection sub-thread detects human body of Rn' and sends a gesture recognition request to the intelligent analysis server, and the specific steps are as follows:

3.1. loading parameters of a deep neural network model (network for short) including anchor frame information;

the anchor frame information refers to the width and height information of anchor frames obtained by clustering during training of a network, and the width and height information is (10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (59, 119), (116, 90), (156, 198), (373, 326) respectively;

3.2. processing the resolution of an input image to 464 x 464, and processing the input image through a network convolution layer and a pooling layer to obtain a feature map with the resolution of 13 x 13;

3.3. the target frame is predicted for the feature map through a network prediction layer, and the specific steps are as follows:

3.3.1. for each pixel on the feature map, predicting information of 5 target frames, target frame categories and corresponding confidence and category probabilities by using anchor frames;

finally, the target frame information (x, y, w, h) of the category probability front 5 is respectively (249.5, 346, 16, 449) (249.5, 462, 15, 449) (249.5, 461.5, 15, 449) (249.5, 232, 82, 449) (249.5, 404.5, 23, 449);

3.3.2. filtering a target frame with the confidence coefficient lower than a preset threshold value T, wherein the threshold value T is 0.7;

3.3.3. removing the repeated target frames by using the maximum inhibition to the remained target frames;

3.3.4. selecting a target frame with highest category probability, and outputting coordinates of a lower left corner and an upper right corner;

finally, the target frame coordinate with the highest probability is obtained, the probability is 0.97645, and the coordinates of the lower left corner and the upper right corner are (0, 338) respectively (499, 354);

3.4. the detected image frames in the step 3.3 contain human bodies, and the original image frames with human bodies and target frame information are packaged into gesture recognition requests every 5 frames and sent to an intelligent analysis server;

4. the intelligent analysis server receives the gesture recognition request, performs gesture recognition processing, and returns a result to the forwarding server, and the specific steps are as follows:

4.1. the main thread creates an identification sub-thread and an identification buffer queue for each video channel, and the specific steps are as follows:

4.1.1. the main thread receives a gesture recognition request sent from a forwarding server, and mounts the received image frames to a recognition buffer queue corresponding to the channel;

4.1.2. the identification sub-thread corresponding to the channel acquires the image frames in the identification buffer queue and marks the image frames as an original picture S;

4.2. the identification sub-thread loads a deep neural network model formed by four-level networks, and extracts characteristic information of human joints in an original picture S, and the specific steps are as follows:

4.2.1. the first level network comprises two paths of network N ₁₁ And N ₁₂ Generating a feature map F ¹ ₁ ～F ¹ ₁₄ The method comprises the following specific steps:

4.2.1.1.N ₁₁ the network extracts the characteristics on the original picture S by utilizing a plurality of residual modules, and outputs 14 paths of characteristic diagrams F _11-1 ～F _11-14 ；N ₁₂ The network firstly downsamples the original picture S, then passes through a plurality of residual modules, and then upsamples to output 14 paths of characteristic diagrams F _12-1 ～F _12-14 The method comprises the steps of carrying out a first treatment on the surface of the Each path of characteristic diagram corresponds to a gateway node with highest Gaussian response;

4.2.2. based on the feature map of 4.2.1.1, an encoder and a decoder are introduced to generate a feature map F 'with weight W' _11-1 ～F‘ _11-14 And F' _12-1 ～F‘ _12-14 The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the following specific steps:

4.2.2.1. the encoder uniformly divides each feature map of F2.1.1 into a plurality of areas, which are called feature area a, and the code corresponding to each area is y;

the set of characteristic regions a is denoted as R ^D ＝{a ₁ ，…，a _i ，…，a _L And }, wherein a _i An i-th region, 1 < =i < =l, and L represents the number of divided regions; a, a _i ∈R ^D ，R ^D A complete feature map with D as a cutting unit, where D represents the number of pixels in a region, and if the region size is 14×14, d=196；

4.2.2.2. The decoder calculates a weight W for each feature map of F2.1.1 _11-1 ～W _11-14 And W is _12-1 ～W _12-14 : each weight W is a set of weights W for all feature regions in the feature map, denoted w= { W ₁ ，…，w _i ，…，w _L -a }; weights w of each feature region _i Representing the proportionality coefficient occupied by the characteristic region when the characteristic region is input to the next stage of processing, and calculating by utilizing the characteristic region a and the code y;

4.2.3. fusion profile F' _11-1 ～F‘ _11-14 And F' _12-1 ～F‘ _12-14 Obtaining the output characteristic diagram F of the network of the stage ¹ ₁ ～F ¹ ₁₄ ；

4.2.4. Original picture S and feature map F ¹ ₁ ～F ¹ ₁₄ As input to the next level of network;

4.2.5. the second level sub-network comprises two paths of network N ₂₁ And N ₂₂ Feature map F output by first level subnetwork ¹ ₁ ～F ¹ ₁₄ And the original picture S is taken as input, the specific steps of the step 4.2.1 are repeated to obtain output F ² ₁ ～F ² ₁₄ ；

4.2.6. The third level sub-network comprises two paths of network N ₃₁ And N ₃₂ Feature map F output by second level subnetwork ² ₁ ～F ² ₁₄ And the original picture S is taken as input, and the specific steps in the step 4.2.1 are repeated to obtain output F ³ ₁ ～F ³ ₁₄ ；

4.2.7. The fourth level sub-network comprises two paths of network N ₄₁ And N ₄₂ Feature map F output by third-order subnetwork ³ ₁ ～F ³ ₁₄ And the original picture S is taken as input, and the specific steps in the step 4.2.1 are repeated to obtain output F ⁴ ₁ ～F ⁴ ₁₄ ；

4.2.8. Using feature maps F ⁴ ₁ ～F ⁴ ₁₄ Extracting names and coordinates of all joints, connecting adjacent joints according to a physiological common sense, calculating distances between joint points, and constructing a skeleton diagram, wherein the skeleton diagram comprises the following steps:

the upper left corner of the picture is taken as an origin of coordinates, and the coordinate values of the head, the neck, the right shoulder, the right elbow, the right wrist, the left shoulder, the left elbow, the left wrist, the right hip, the right knee, the right ankle, the left hip, the left knee and the left ankle are respectively (62, 123) (98, 120) (108, 95) (138, 67) (162, 85) (107, 144) (115, 169) (82, 161) (166, 103) (144, 105) (299, 113) (166, 131) (248, 127) (300, 131); the calculated (head, neck), (right shoulder, right elbow), (right elbow, right wrist), (left shoulder, left elbow), (left elbow, left wrist), (right hip, right knee), (right knee, right ankle), (left hip, left knee), (left knee, left ankle) line lengths are 36.12, 41.04, 30.00, 26.25, 33.96, 78.03, 55.58, 82.10, 52.15 respectively;

4.3. the recognition sub-thread classifies the gesture of the skeleton map by using an SVM classifier, and the specific steps are as follows:

4.3.1. loading a trained SVM classifier, classifying the skeleton diagram, and respectively obtaining recognition results: sitting: 92%, lie: 85%, station: 95%, kneel: 80.3%; selecting the gesture category with highest accuracy as a station;

4.3.2. the skeleton information, the gesture category station and the accuracy of 95 percent are returned to the forwarding server as the identification result;

5. the main thread of the forwarding server receives the identification result and forwards the identification result to the client, and the client displays the result on an interface; the method comprises the following specific steps:

5.1. the main thread receives the identification result, and writes the identification result into the latest forwarding data packet in the annular buffer queue of the corresponding channel;

5.2. the forwarding sub-thread of the corresponding channel acquires a forwarding data packet from the forwarding buffer queue and sends the forwarding data packet to the client;

5.3. the client receives the forwarding data packet containing the identification result, extracts the video frame and the identification information, and displays the video frame and the identification information on the client interface.

Finally, it should be noted that the examples are disclosed for the purpose of aiding in the further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. A multichannel rapid human body posture identification method for intelligent video monitoring comprises the following steps:

D1.2. Separately computing videoFrame F _n And F is equal to _n-1 ，F _n+1 And F is equal to _n Respectively as t _n-1 Time sum t _n Time foreground image D _n-1 And D _n ；

D2.2.1. let D be D _n Gray values of the pixels in';

D2.2.2. if d > T1, R is recorded _n =255, i.e. the motion target point;

D2.2.3. if d < = T1, R is recorded _n =0, i.e. background point;

F2.1. first-stage netThe network comprises two paths of networks N ₁₁ And N ₁₂ Generating a feature map F ¹ ₁ ～F ¹ ₁₄ The method comprises the following specific steps:

F2.1.2.2. The decoder calculates a weight W for each feature map of F2.1.1 _11-1 ～W _11-14 And W is _12-1 ～W _12-14 : each weight W is a set of weights W for all feature regions in the feature map, denoted w= { W ₁ ，…，w _i ，…，w _L -a }; weights w of each feature region _i Representing the proportionality coefficient occupied by the characteristic region when the characteristic region is input to the next stage of processing, and calculating by utilizing the characteristic region a and the code y;

F2.5. Using feature maps F ⁴ ₁ ～F ⁴ ₁₄ Extracting the names and coordinates of all joints, connecting adjacent joints according to the common physiological knowledge, and calculating the jointThe distance between the nodes, constructing a skeleton diagram; the skeleton diagram comprises names and coordinates of all the joint points and the distance between the joint points;