CN112084958A

CN112084958A - Method and device for recognizing human skeleton of multiple persons at mobile terminal

Info

Publication number: CN112084958A
Application number: CN202010952910.5A
Authority: CN
Inventors: 张德宇; 章晋睿; 许晓晖; 贾富程; 张尧学
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-15
Anticipated expiration: 2040-09-11
Also published as: CN112084958B

Abstract

The invention discloses a method and a device for identifying human skeletons of multiple persons at a mobile terminal, wherein the method comprises the following steps: s1, decoding from a video to obtain an image frame; s2, identifying a human body region from the image frame to generate a human body subgraph; and S3, identifying and scheduling the human body subgraph, allocating the human body subgraph to a CPU (Central processing Unit) and/or a GPU (graphics processing Unit) of the mobile terminal for identification, and respectively operating a preset skeleton identification model through the CPU and/or the GPU to identify the human body skeleton. The invention has the advantages of high recognition speed, low delay, high recognition precision and the like.

Description

Method and device for recognizing human skeleton of multiple persons at mobile terminal

Technical Field

The invention relates to the field of mobile computing, in particular to a method and a device for recognizing human skeletons of multiple persons at a mobile terminal.

Background

Along with the development of mobile device performance, the application field of human gesture estimation is wider and wider, for example action recognition through tracking the change of a person's gesture in a period of time to realize the reading to human action, this can be used for detecting whether one falls down or the disease, also can be used for the automatic teaching of body-building sports and dance.

At present, a human body posture estimation model based on deep learning is difficult to achieve a satisfactory effect on mobile terminal equipment with limited resources. Although the current latest models can achieve higher accuracy, huge calculation task amount is brought, and the models are operated on the mobile device, so that great calculation delay is caused, and extremely poor user experience is brought. As disclosed in SUN, k., XIAO, b., LIU, d., AND WANG, j.deep high-resolution representation for human position estimation. arXiv representation arXiv:1902.09212(2019), the single pose estimation model achieves a high accuracy on the MPII dataset, AND the PCKh value (Head-normalized Percentage of Correct keys) is 92.3, but when running the model onto the mobile end device, it takes 600 ms even under the support of the mobile device GPU for typical CNN processing on a video frame, AND therefore, it is difficult to implement AND achieve good results on the mobile end. While

A posture recognition model Posenet suitable for the operation of a mobile terminal, which is proposed by Google corporation, designs a human body posture recognition model suitable for the mobile terminal through a deep learning model compression method. The conventional convolution operation (CNN) is replaced by a deep separable convolution (Depthwise CNN) with less computation in PoseNet, and the deep separable convolutions are stacked step by step into a single-branch network structure. PoseNet runs a single person gesture recognition model on Snapdragon 845SoC, and the delay of running on the CPU is reduced to only 60 ms. However, this is only a single person pose estimation, and if the pose estimation is a multi-person pose estimation, the delay will increase linearly with the increase of the number of people, and the loss of accuracy of the model after compression is large, for example, the PCKh value of the PoseNet in the MPII data set is only 80.3.

Therefore, how to ensure the recognition speed, reduce the delay and ensure the recognition accuracy in the human body skeleton recognition at the mobile end, especially in the human body skeleton recognition of multiple people, is still a problem to be solved urgently.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a method and a device for recognizing human skeletons of multiple persons at a mobile terminal, which have the advantages of high recognition speed, low delay and high recognition precision.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: a method for recognizing human skeleton of multiple persons at a mobile terminal comprises the following steps:

s1, decoding from a video to obtain an image frame;

s2, identifying a human body region from the image frame to generate a human body subgraph;

and S3, identifying and scheduling the human body subgraph, allocating the human body subgraph to a CPU (Central processing Unit) and/or a GPU (graphics processing Unit) of the mobile terminal for identification, and respectively operating a preset skeleton identification model through the CPU and/or the GPU to identify the human body skeleton.

Further, the image frame in the step S1 includes a key frame and a tracking frame;

the key frame is a first predetermined frame in the video, and the tracking frame is a second predetermined frame in the video;

or: the key frame is based on a preset reference point, if the displacement of the reference point in the current image frame relative to the reference point in the previous key frame is greater than a preset threshold value, the current image frame is the key frame, otherwise, the current image frame is the tracking frame.

Further, the step S2 specifically includes: aiming at the key frame, identifying a human body region through a fine-grained human body identification model, and generating a human body subgraph of the key frame; and for the tracking frame, identifying a human body region through a human body identification model based on a motion vector, and generating a human body subgraph of the tracking frame.

Further, the step S3 specifically includes:

when the number of the human body subgraphs to be recognized meets the condition that h is larger than or equal to k + a, monitoring the idle thread conditions of a CPU (central processing unit) and a GPU (graphics processing unit) of the mobile terminal, and when the idle thread exists, allocating one human body subgraph to be recognized for the idle thread to recognize the human body skeleton, wherein h is the number of the human body subgraphs to be recognized, k is a preset allocation parameter, and a is the maximum thread number of the CPU of the mobile terminal.

Further, the step S3 specifically includes:

when the number of the human body subgraphs to be identified satisfies h<k + a, first, the

Distributing the human body subgraphs to be identified to a GPU;

judging the number of the rest human body subgraphs to be identified when the number meets the requirement

And monitoring the idle thread condition of the CPU of the mobile terminal, distributing one human body subgraph to be identified to the idle thread when the idle thread exists, and otherwise, distributing one human body subgraph to be identified to more than one thread.

Further, the preset distribution parameter is determined according to the ratio of the CPU operation delay and the GPU operation delay of the mobile terminal.

Further, the preset skeleton recognition model in step S3 is a convolutional neural network model, and a neural network layer of the convolutional neural network model includes at least 1 inversion residual convolutional layer and at least 1 enhancement inversion residual convolutional layer; the inverted residual convolution layer adopts a ReLU6 nonlinear activation function; the enhanced inverted residual convolutional layer adopts an Squeeze and Excitation module and an H-Swish activation function.

Further, the innermost neural network of the convolutional neural network model comprises 2 or more than 2 enhanced inverse residual convolutional layers.

Further, a preprocessing layer is arranged between the input layer of the convolutional neural network and the first neural network layer; the preprocessing layer comprises at least 1 first inverted residual model and at least 1 second inverted residual model; the step sizes of the first inverted residual error model and the second inverted residual error model are not equal.

A multi-person human body skeleton recognition device at a mobile terminal comprises a processor and a memory, wherein the processor is used for executing an application program in the memory, and the memory stores the application program which can realize the human body skeleton recognition method when being executed.

Compared with the prior art, the invention has the advantages that:

1. the invention can effectively utilize the computing power of the CPU and the GPU by scheduling the framework recognition task of the human body subgraph to the CPU and/or the GPU for framework recognition, adopts a parallelization processing strategy, and can effectively improve the speed of human body framework recognition and reduce the delay of framework recognition especially when the human body framework recognition of a plurality of people is involved.

2. In the convolutional neural network model, the inverse residual convolutional layer and the enhanced inverse residual convolutional layer are arranged on the neural network layer, so that the convolutional neural network model has high precision and low time delay and effectively reduces the calculated amount of the convolutional neural network model.

3. The method decomposes the video into the key frame and the tracking frame, adopts the human body recognition model with fine granularity to recognize the human body area for the key frame, adopts the human body recognition model based on the motion vector to recognize the human body area for the tracking frame, effectively utilizes the accuracy of the human body recognition model with fine granularity and the high efficiency of the human body recognition model based on the motion vector, can quickly and efficiently determine the human body area from continuous video frames, effectively improves the efficiency and the speed of framing the human body area, and reduces the calculation amount of human body area recognition.

Drawings

FIG. 1 is a schematic process flow diagram of an embodiment of the present invention.

FIG. 2 is a diagram illustrating human body joints and macro block division according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating joint point determination through consecutive motion frames according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a convolutional neural network model according to an embodiment of the present invention.

FIG. 5 is a diagram illustrating comparative analysis of the delay effect according to an embodiment of the present invention.

FIG. 6 is a graph of a comparison analysis of recognition accuracy for an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

As shown in fig. 1, the method for recognizing a human skeleton of multiple persons at a mobile terminal of the present embodiment includes the following steps: s1, decoding from a video to obtain an image frame; s2, identifying a human body region from the image frame to generate a human body subgraph; and S3, identifying and scheduling the human body subgraph, allocating the human body subgraph to a CPU (Central processing Unit) and/or a GPU (graphics processing Unit) of the mobile terminal for identification, and operating a preset skeleton identification model through the CPU and/or the GPU respectively to identify the human body skeleton.

In the present embodiment, the image frame in step S1 includes a key frame and a tracking frame; the key frame is a first preset frame in the video, and the tracking frame is a second preset frame in the video; or: the key frame is based on a preset reference point, if the displacement of the reference point in the current image frame relative to the reference point in the previous key frame is greater than a preset threshold value, the current image frame is the key frame, otherwise, the current image frame is the tracking frame. In the first case: the first predetermined frame may be a specified start frame in the video, and frames separated by a certain step, such as frame 1, frame 11, and frame 21 as key frames when the step is 10, and the rest of the frames are tracking frames. In the second case: if the 1 st frame is determined as a key frame, then the joint points of the human body are used as reference points, such as wrist joints, ankle joints and the like, and when the joint points pass through the images of a plurality of continuous frames and the displacement is larger than a preset threshold value, the next image frame is used as a key frame. The other frames are used as tracking frames. In this embodiment, through a large number of experimental analyses, by using the technical scheme of identifying the position of the human body region of this embodiment, after the displacement of the reference point reaches a certain distance, the accuracy of identification will be seriously affected, and in order to ensure the identification accuracy and the identification efficiency, the preset threshold is preferably the distance of 10 pixels.

In this embodiment, step S2 specifically includes: aiming at the key frame, identifying a human body region through a fine-grained human body identification model to generate a human body subgraph of the key frame; and for the tracking frame, identifying a human body region through a human body identification model based on the motion vector, and generating a human body subgraph of the tracking frame. Compared with a human body recognition model based on motion vectors, the fine-grained human body recognition model in the embodiment is a model with higher recognition accuracy, and is preferably a MobileNet-V3-SSDLite model, and a human body position region in an image frame can be accurately calibrated through the model to generate a human body subgraph. Of course, other models may be used for the fine-grained human recognition model. After the human body position area in the key frame is accurately calibrated through the fine-grained human body recognition model, for the tracking frame behind the key frame, the position of a human body can be tracked and recognized through a Motion Vector (Motion Vector) generated by displacement change of the human body between front and back continuous image frames according to the displacement change of the Motion Vector, the human body position area in the tracking frame is quickly calibrated, and the human body subgraph of the tracking frame is generated. In the human body recognition model based on the motion vector, the human body skeleton joint points are taken as the center, such as the wrist joint and the ankle joint shown in fig. 2 and 3, the motion vector of the macro block around the center is selected to predict the motion of the joint, so that the adverse effect caused by the motion vector of the wrong macro block can be reduced to the maximum extent, and the situation that the body trunk part is still and part of limbs, such as the palm and the arm, move around the joint points in the human body motion process is met. As shown in fig. 2, taking the left wrist as an example, taking the identified left wrist joint point as the center, selecting 4 frames respectively at the upper left, the upper right, the lower left and the lower right, i.e. the region of interest of the left wrist joint point, then calculating the motion vectors in the region, counting the motion vectors moving more in the same direction as effective motion vectors, and obtaining the moving direction and the displacement of the joint point according to the moving direction and the moving amount of the effective motion vectors. The motion vector can be obtained by a hardware coding interface of a mobile phone Android system, the bottom layer of the motion vector can calculate the motion vector by using digital signal processing equipment, and the motion vector describes displacement information of each macro block (Macroblock) between two frames.

In the present embodiment, the joint points of the skeleton are determined by motion vectors and an adaptive regression algorithm. In the continuous 3 frames shown in fig. 3, the left ankle is taken as an example of going forward. In fig. 3, light dots and arrows are motion vectors of objects in the picture, and it can be seen that the motion vector of a region without motion in the frame is in a dot shape, the value of which is 0, and the ankle region moves, so that the motion vector of the portion is an arrow, the arrow direction is the ankle moving direction, and the arrow length is the displacement of the ankle moving. In the diagram corresponding to time t in fig. 3, an enlarged view of the ankle is used for explanation, where a dot below the middle is a joint point identified by the skeleton identification model in the previous frame and used as a reference for predicting the joint point by using the motion vector of the next frame, a dot on the left side of the ankle is a joint point predicted by using an autoregressive filter according to the joint point position of the previous frame and the motion vector displacement of the previous frame, and a dot on the right side is a joint point position actually obtained according to the motion vector displacement of the current frame. The dots are white, outlined and black filled dots in the figure. And finally, framing the person by obtaining the joint points.

In this embodiment, it is further preferable to calibrate the positions of the tracked skeleton points by using a diffusion kalman filter, so as to further improve the accuracy of the identification. Therefore, the high-precision characteristic of the fine-grained human body recognition model and the high efficiency of the human body recognition model based on the motion vector are effectively utilized, the human body position area in each frame can be quickly and accurately marked in the continuous frames, the human body subgraph is generated, the total calculated amount is effectively reduced, and the efficiency and the accuracy are greatly improved.

In this embodiment, step S3 specifically includes: when the number of the human body subgraphs to be identified meets the condition that h is larger than or equal to k + a, monitoring the idle thread conditions of a CPU (central processing unit) and a GPU (graphics processing unit) of the mobile terminal, and when the idle thread exists, allocating one human body subgraph to be identified for the idle thread to identify the human body skeleton, wherein h is the number of the human body subgraphs to be identified, k is a preset allocation parameter, and a is the maximum thread number of the CPU of the mobile terminal.

In this embodiment, the thread of the CPU is determined by the core of the CPU used for the skeleton recognition in the present technical solution, for example, in the existing mobile phone including 4 big cores and 4 small cores, when only 4 big cores are used for the skeleton recognition, then, the maximum number of threads of the CPU is 4; for example, for a CPU with cores of different sizes, when all cores are used for skeleton recognition, the maximum thread number of the CPU is the core number of the CPU.

In this embodiment, step S3 specifically includes: when the number of the human body subgraphs to be recognized satisfies h<k + a, first, the

Distributing the human body subgraphs to be recognized to a GPU; then judging the number of the rest human body subgraphs to be identified when the number meets the requirement

And monitoring the idle thread condition of the CPU of the mobile terminal, and distributing a human body subgraph to be identified to the idle thread when the idle thread exists, or distributing one human body subgraph to be identified to more than one thread.

In this embodiment, the preset allocation parameter is determined according to a ratio of the CPU operation delay to the GPU operation delay of the mobile terminal. Further, preferably, the CPU operation delay refers to a time required for a single thread of the CPU to perform human skeleton recognition on a single human body subgraph, and the GPU operation delay refers to a time required for the GPU to perform human skeleton recognition on a single human body subgraph. Of course, parameters related to the operation delay may also be directly determined, such as the ratio of the computation speed of the CPU to the computation speed of the GPU.

In this embodiment, the skeleton recognition model preset in step S3 is a convolutional neural network model, and a neural network layer of the convolutional neural network model includes at least 1 inversion residual convolutional layer and at least 1 enhancement inversion residual convolutional layer; the inverse residual convolution layer adopts a ReLU6 nonlinear activation function; the enhanced inverse residual convolutional layer employs the Squeeze and Excitation module and the H-Swish activation function. Preferably, the innermost neural network of the convolutional neural network model comprises 2 or more than 2 enhanced inverse residual convolutional layers. Further preferably, a preprocessing layer is arranged between the input layer of the convolutional neural network and the first neural network layer; the preprocessing layer comprises at least 1 first inverted residual model and at least 1 second inverted residual model; the step sizes of the first inverted residual error model and the second inverted residual error model are not equal.

In this embodiment, the convolutional neural network model is a 192 × 192 × 3 picture as shown in fig. 4, and the output is a 48 × 48 × 3 human key point thermodynamic diagram. The network structure of Hourglass and U-Net is referred to in the whole network structure, the picture is firstly reduced from high resolution to low resolution, then the low resolution is restored to the high resolution after the characteristics are extracted, so that the network structure with the resolution reduced first and then increased can capture information of the scale of the image as much as possible, and the accuracy of model identification can be effectively improved. In the convolutional neural network model, the input picture is reduced from high resolution to very low resolution through convolution and maximum pooling operation, meanwhile, the network gradually reduces the low resolution back to the high resolution through bilinear interpolation operation, and features under different scales are fused to obtain a final output result. The features of the human skeleton points are extracted by using an inverted residual convolution as a basic convolution operation, wherein an inverted residual convolution model is formed by serially connecting a 1x1 convolution operation containing a ReLU6 nonlinear activation function, a 3x3 deep convolution operation and a 1x1 convolution operation containing a linear activation function, and the computation amount of the convolution operation is greatly reduced compared with the traditional CNN convolution. In this embodiment, because the number of parameters and the amount of calculation in the inverse residual convolution operation are both reduced, the accuracy of extracting features of the inverse residual convolution operation is also correspondingly reduced, which may cause the reduction of the accuracy of the whole human posture recognition model. In order to make the model have high precision and low delay, the innermost layer of the convolutional neural network model of the embodiment adopts enhanced inversion residual convolution, i.e. combining the Squeeze and Excitation modules in the operation of inversion residual, and replacing the ReLU6 with an H-Swish activation function. According to the characteristics of the human body structure, parts which are difficult to identify, such as ankles, exist, therefore, the feature extraction capability needs to be further enhanced for skeleton points which are difficult to identify, and feature information of the image at low resolution can be extracted by adopting enhanced inversion residual convolution through the innermost layer of network. In this embodiment, a Layer Module (n, e) is defined as an extracted feature Module of a model network, where n represents the total number of convolution operations including an inverted residual modulus and an enhanced inverted residual convolution operation in the Layer network, and e represents the number of convolutions including the enhanced inverted residual in the network. In a high-resolution network, Layer Module (4, 1) is adopted for feature extraction, the recognition degree of joint points such as the head and the neck of a human body is high, the feature extraction is relatively obvious, in a low-resolution network, joint points such as the wrist and the ankle are mainly recognized, and due to the fact that the difficulty in feature extraction is high, in order to improve the capability of model feature extraction, Layer Module (4, 2) is adopted for feature extraction of the joint points. Therefore, the recognition precision of the model to each joint point of the human body can be effectively improved, and meanwhile, the calculated amount of the model can be reduced.

The multi-person human body skeleton recognition device of the mobile terminal in the embodiment comprises a processor and a memory, wherein the processor is used for executing an application program in the memory, and the memory stores the application program which can realize the human body skeleton recognition method when being executed.

In this embodiment, the following conclusions can be drawn by running the technical solution of this embodiment on different models of mobile phones and comparing with the prior art. As shown in fig. 5, compared with the technical solutions that only operate by a CPU and only operate by a GPU, the technical solution of this embodiment obviously optimizes the delay on the mobile phone platforms such as vivoiqooo, millet 8, and HiKey970, and obviously improves the efficiency of identification. Meanwhile, as shown in fig. 6, the scheme of the present embodiment is also substantially equivalent to the comparison technical scheme in terms of accuracy, and some indexes are improved to some extent.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A method for recognizing human skeleton of multiple persons at a mobile terminal is characterized by comprising the following steps:

s1, decoding from a video to obtain an image frame;

2. The method for multi-person human skeleton recognition at a mobile end according to claim 1, wherein: the image frame in the step S1 includes a key frame and a tracking frame;

3. The method for multi-person human skeleton recognition at a mobile end according to claim 2, wherein: the step S2 specifically includes: aiming at the key frame, identifying a human body region through a fine-grained human body identification model, and generating a human body subgraph of the key frame; and for the tracking frame, identifying a human body region through a human body identification model based on a motion vector, and generating a human body subgraph of the tracking frame.

4. The mobile-end multi-person human skeleton recognition method according to any one of claims 1 to 3, wherein: the step S3 specifically includes:

5. The method for multi-person human skeleton recognition at the mobile end according to claim 4, wherein: the step S3 specifically includes:

Distributing the human body subgraphs to be identified to a GPU;

6. The method for multi-person human skeleton recognition at a mobile end according to claim 5, wherein: and the preset distribution parameters are determined according to the ratio of the CPU operation delay and the GPU operation delay of the mobile terminal.

7. The mobile-end multi-person human skeleton recognition method according to any one of claims 1 to 3, wherein: the preset skeleton recognition model in the step S3 is a convolutional neural network model, and a neural network layer of the convolutional neural network model includes at least 1 inversion residual convolutional layer and at least 1 enhancement inversion residual convolutional layer; the inverted residual convolution layer adopts a ReLU6 nonlinear activation function; the enhanced inverted residual convolutional layer adopts an Squeeze and Excitation module and an H-Swish activation function.

8. The method for multi-person human skeleton recognition at a mobile end according to claim 7, wherein: the innermost neural network of the convolutional neural network model comprises 2 or more than 2 enhanced inverted residual convolutional layers.

9. The method for multi-person human skeleton recognition at a mobile end according to claim 8, wherein: a preprocessing layer is also arranged between the input layer of the convolutional neural network and the first neural network layer; the preprocessing layer comprises at least 1 first inverted residual model and at least 1 second inverted residual model; the step sizes of the first inverted residual error model and the second inverted residual error model are not equal.

10. A multi-person human skeleton recognition device at a mobile terminal comprises a processor and a memory, wherein the processor is used for executing an application program in the memory, and the device is characterized in that: the memory stores an application program which can realize the multi-person human skeleton recognition method of the mobile terminal according to any one of claims 1 to 9 when executed.