CN111191622A

CN111191622A - Posture recognition method and system based on thermodynamic diagram and offset vector and storage medium

Info

Publication number: CN111191622A
Application number: CN202010006031.3A
Authority: CN
Inventors: 肖菁; 李海超; 屈光卓
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2020-05-22
Anticipated expiration: 2040-01-03
Also published as: CN111191622B

Abstract

The invention discloses a method, a system and a storage medium for gesture recognition based on thermodynamic diagrams and offset vectors, wherein the method comprises the following steps: acquiring a target image to be identified; extracting the characteristics of the target image to be recognized; predicting the positions of key points according to the extracted features; correcting the predicted key points and determining the final positions of the key points; and determining the attitude information of the target to be recognized according to the key points. The invention can correct the prediction result by extracting the characteristics of the image, predicting the positions of the key points and finally identifying to obtain the attitude information.

Description

Posture recognition method and system based on thermodynamic diagram and offset vector and storage medium

Technical Field

The invention relates to the technical field of deep learning, in particular to a method, a system and a storage medium for gesture recognition based on thermodynamic diagrams and offset vectors.

Background

Thermodynamic diagrams: the probability graph is a probability graph, the probability of a pixel point closer to the central point is closer to 1, and the probability of a pixel point farther from the central point is closer to 0, and the probability graph can be specifically simulated through a corresponding function, such as Gaussian and the like.

Offset vector: the displacement between a point and a reference point is deduced from the distance between the point and the reference point.

And (3) attitude estimation: the specific tasks of determining the pose of an object in an image (or stereo image, image sequence), reconstructing the joints and limbs of a person.

People usually record life by taking photos in daily life, and in order to better understand the character information in the photos, people want to locate the positions of people and know the activities carried out by people, and how to realize the targets is the main problem of human posture estimation. Pose estimation, also known as human keypoint detection, primarily identifies the location of key parts of the human body, such as the nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle, among others. Despite years of research, the computer vision is still a very challenging problem in computer vision, and the difficulties mainly come from complex background under natural scenes, blurring, shading, brightness of illumination and colors of clothes. Furthermore, the interaction of limbs between people can cause strong interference, such as the overlapping of limbs and the occlusion between limbs.

Because more than one person is often found in an actual application scene, the current posture estimation algorithm is mainly a multi-person posture algorithm. The multi-person pose estimation algorithm has two main trends, one is a Top-down (Top-down) method, and the other is a Bottom-up (Bottom-up) method. From top to bottom, the Object detection (Object detection) method, such as fast-RCNN (fast Region-based volumetric Neural Networks) or ssd (single Shot multi box detector), is used to obtain the detection frames of multiple people in the image, and then cut them from the original image and transmit them to the pose estimation network at the back, where the network predicts the key points of the human body separately for the cut image. The top-down approach translates the problem of multi-person pose estimation into single-person pose estimation. The bottom-up multi-person posture estimation method is characterized in that key points on all persons are detected firstly, then clustering processing is carried out on the key points, different key points of different persons are connected together, and thus different individuals are generated through clustering. The bottom-up multi-person posture estimation method focuses on exploring a key point clustering method, namely how to construct the relationship between different key points.

With the rapid development of the deep learning technology in the field of computer vision, a large amount of research work for solving human body key point detection by adopting deep learning is emerged in recent years. However, most of the existing work focuses on how to design the data transmission path in the network to obtain the abundant spatial information and detail information in the picture. For example, Feature Pyramid Networks (Feature Pyramid Networks for Object Detection), cascaded convolutional neural Networks (Cascade Pyramid Networks for Multi-Person Pose Estimation), and stacked hourglass Networks (Stack Hourglass Networks for Human Pose Estimation), among others. The methods can improve the accuracy of human body key point detection, but the methods neglect that a small offset occurs in the process of mapping the predicted point from low resolution to high resolution, which causes a certain precision loss.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, and a storage medium for gesture recognition based on a thermodynamic diagram and an offset vector with high accuracy.

The invention provides a gesture recognition method based on thermodynamic diagrams and offset vectors, which comprises the following steps:

acquiring a target image to be identified;

extracting the characteristics of the target image to be recognized;

predicting the positions of key points according to the extracted features;

correcting the predicted key points and determining the final positions of the key points; and

and determining the attitude information of the target to be recognized according to the key points.

Further, the step of performing feature extraction on the target image to be recognized includes:

cutting the obtained target image to be recognized;

inputting each image obtained by cutting into a residual error network; and

and carrying out coding processing through the residual error network to obtain a first characteristic diagram.

Further, the residual error network comprises five convolutional layers;

in addition, the step of obtaining the feature map by performing the encoding process through the residual error network includes the steps of:

carrying out variable-dimension processing on each channel of the feature map through convolution kernel, wherein the variable-dimension processing comprises ascending and descending dimensions;

carrying out normalization processing on each channel; and

and carrying out nonlinear activation processing on the result after the normalization processing.

Further, the step of extracting the features of the target image to be recognized further includes a decoding step, and the decoding step includes:

inputting the obtained first feature map into a deconvolution structure;

decoding the first feature map by a deconvolution structure; and

and acquiring a characteristic response graph of each channel.

Further, the predicting the location of the keypoint based on the extracted features comprises:

acquiring thermodynamic diagrams from output results of the channels;

calculating the maximum value of each thermodynamic diagram to obtain the position information of each key point on the thermodynamic diagram; and

and mapping the position information of the key points to the target image to be recognized according to the size relation between the target image to be recognized and the thermodynamic diagram.

Further, the step of correcting the predicted key points and determining the final positions of the key points includes the following steps:

determining the offset vector of the key point according to the output result of each channel; and

and adding the offset vector to the maximum value of the thermodynamic diagram according to the offset vector to determine the final position of the key point.

Further, the method also comprises the following steps:

training a thermodynamic diagram by adopting a mean square error loss function; and

in training the offset vector, a smooth penalty function is used to handle the gap between the true offset and the predicted offset.

The invention provides in a second aspect a system for gesture recognition based on thermodynamic diagrams and offset vectors, comprising:

the acquisition module is used for acquiring a target image to be identified;

the characteristic extraction module is used for extracting the characteristics of the target image to be identified;

the key point prediction module is used for predicting the position of a key point according to the extracted features;

the key point correction module is used for correcting the predicted key points and determining the final positions of the key points; and

and the gesture determining module is used for determining gesture information of the target to be recognized according to the key points.

A third aspect of the invention provides a system for gesture recognition based on thermodynamic diagrams and offset vectors, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method.

A fourth aspect of the invention provides a storage medium having stored therein processor-executable instructions for performing the method when executed by a processor.

One or more of the above-described embodiments of the present invention have the following advantages: according to the embodiment of the invention, the characteristics of the image are extracted, the positions of the key points are predicted, the prediction result can be corrected, and finally the attitude information is obtained through recognition.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating the overall steps of an embodiment of the present invention;

FIG. 2 is a first exemplary flow chart of an embodiment of the present invention;

FIG. 3 is a second exemplary flow chart of an embodiment of the present invention;

FIG. 4 is a schematic diagram of a coordinate offset correction thermodynamic diagram predicted coordinate location according to an embodiment of the present invention;

FIG. 5 is a comparison of the results of various algorithms on the MSCOCO data set according to embodiments of the present invention;

FIG. 6 is a comparison of various algorithms on an MPII data set according to an embodiment of the present invention;

FIG. 7 is a comparison of results of various algorithms of embodiments of the present invention on a CROWPOSE data set;

FIG. 8 shows the results of the detection of HOPE on the MSCOCO data set according to the embodiment of the present invention;

FIG. 9 shows the result of the detection of the HOPE on the MPII data set according to the embodiment of the present invention;

fig. 10 shows the detection result of the HOPE on the CROWDPOSE data set according to the embodiment of the invention.

Detailed Description

The invention will be further explained and explained with reference to the drawings and the embodiments in the description. The step numbers in the embodiments of the present invention are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.

Since the prior art mostly only has innovative thermodynamic methods for network architecture and focuses mainly on the loss function. However, the thermodynamic diagram based method has a coordinate mapping process, which ignores the loss caused by the mapping of the coordinates obtained by the thermodynamic diagram with low resolution of the predicted points back to the original image loss, and this limits the improvement of the accuracy.

Therefore, the application provides a human body posture estimation method based on thermodynamic diagrams and coordinate migration, which predicts thermodynamic diagrams and migration vectors of key points by extracting features through a convolutional neural network with strong robustness, predicts the coordinates of the key points by using the thermodynamic diagrams, and corrects the coordinates of the key points by using the migration vectors so as to obtain more accurate position information.

Referring to fig. 1, the specific implementation steps of the embodiment of the present application include:

s1: acquiring a target image to be identified;

s2: extracting the characteristics of the target image to be recognized;

as shown in fig. 2 and fig. 3, the feature extraction in the embodiment of the present application is to convert a picture into features, and the network structure of the model is mainly divided into two parts, one is an encoding module and the other is a decoding module. The coding module of the application adopts a residual error network of 50 layers and deletes the last convolution layer of 1x1, and the module extracts the characteristics of the input image in a full convolution mode, particularly the design of the residual error makes the coding module perform very well in many computer vision tasks and have very strong characteristic expression capability.

The residual network of the embodiment is composed of c1, c2, c3, c4, c5 and 5 groups of convolutional layers, and each layer contains N residual modules. The residual module is composed of alternating convolutional layers, BN layers and ReLU, the 1x1 convolution kernel is mainly used for reducing or increasing the dimension of a channel of the feature map, and the calculated amount of the next convolution kernel can be effectively reduced through 1x1 dimension reduction before the 3x3 convolution kernel is input. And the BN layer is a batch normalization layer, each channel has four corresponding parameters, namely a mean value, a variance, a coefficient of telescopic transformation and an offset, and the four parameters are used for normalizing the characteristics input into the BN layer, so that the problem that the data distribution of the middle layer is changed in the model training process, and the gradient disappears or the gradient explodes is solved. The ReLU is used as a nonlinear activation function, on one hand, the nonlinear expression capability of the network is improved, and on the other hand, the problem that the parameters of the Sigmoid function are updated slowly in a saturation area is solved. As shown in fig. 2, the specific implementation steps of the encoding step in the embodiment of the present application are: firstly, cutting an acquired image; secondly, inputting the image into a residual error network; and thirdly, acquiring the coded characteristic diagram from the residual error network.

In addition, the embodiment of the present application further includes a decoding step, as shown in fig. 2, the decoding step includes: firstly, inputting an acquired feature map into a deconvolution structure; secondly, coding the feature map by the deconvolution structure; third, convolution from a 1x1 will result in a characteristic response map of 3x n channels. As shown in fig. 3, the network outputs a thermodynamic diagram and an offset vector, respectively, where the thermodynamic diagram corresponds to n channels and is used to predict the positions of n keypoints, and the offset vector corresponds to 2 × n channels and is used to predict the offset of the keypoints at each position in the x and y directions, respectively, and the size of the final network end feature map is 64 × 48, which is one fourth of the input image in width and height.

S3: predicting the positions of key points according to the extracted features;

specifically, the embodiment of the present application assumes that the location of the kth key point is l_kIf position x on thermodynamic diagram_iAnd a key point l_kThe distance of (a) does not exceed the radius R, the probability that each position in the circle is a real key point is subjected to Gaussian distribution, thus being more beneficial to network learning, namely h_k(x_i)＝G(x_i-l_k)if||x_i-l_k| | < R, otherwise h_k(x_i) 0, wherein G represents a gaussian function. It is clear that the distance key point l_kThe closer, the greater the probability that the location on the thermodynamic diagram is a keypoint. Specifically, the implementation step of predicting the key points includes: firstly, obtaining a thermodynamic diagram in an output channel of a network; second, for each keypoint l_kAll correspond to a thermodynamic diagram h_kOn each thermodynamic diagramObtaining the position of each key point on the thermodynamic diagram; and thirdly, mapping the coordinates from the thermodynamic diagram to the input image according to the multiple relation between the size of the input image and the thermodynamic diagram size.

S4: correcting the predicted key points and determining the final positions of the key points;

specifically, in the embodiment of the present application, there is a precision loss when the key point is mapped from the low-resolution image to the high-resolution image, as shown in fig. 4(b), each grid represents the position of one pixel, and the area enclosed by the rectangular frame in fig. 4(a) is intended to thermally predict the position of the left wrist, but when the predicted coordinates are mapped to the resolution of the input image, a large precision loss occurs. As can be seen from fig. 4(b), one pixel in the thermodynamic diagram actually represents the position of 16 pixels of the original image, because the width and the height are both one fourth of the original image, and each time the coordinate product 4 on the thermodynamic diagram can only be mapped to the first pixel of the corresponding area of the input image, that is, the position of the upper left corner of the 16 grids in fig. 4(b), which is the root of the precision loss in the coordinate mapping process. Many efforts have been made to reduce the loss of accuracy in coordinate mapping by manually shifting the thermodynamic predicted keypoint locations by a quarter of a pixel at this stage, i.e., by a distance of 1 pixel on the original input image, which does reduce the expected error between the mapped keypoint and the true keypoint, resulting in a slight improvement in accuracy, but does not solve the problem of loss of accuracy at the root.

Based on such a situation, the network of the present application predicts each location x in addition to outputting the thermodynamic diagram_iTwo-dimensional offset vector o with respect to input image_k(x_i) Let the neural network actively learn the offset, o, between the mapped key points and the true key points_k(x_i) Which represents the shift of a certain position xi on the kth thermodynamic diagram after mapping with respect to the kth keypoint on the input image, with the aim of correcting the predicted position of the keypoint. Since there are k keypoints, the network of the present application generates k such offset fields, one for each keypoint and othersThe nearby locations solve a two-dimensional regression problem.

Referring to fig. 2, the specific implementation steps of the correction step include a first step of taking the position of the maximum value of the thermodynamic diagram after the network generates the thermodynamic diagram and the offset vector, and a second step of adding the offset vector to the maximum position of the thermodynamic diagram to obtain the key point position which is finally mapped to the input image, namely keypoints positions.

S5: and determining the attitude information of the target to be recognized according to the key points.

In addition, the embodiment of the application also provides steps of model training and testing, specifically:

the thermodynamic diagram is trained by adopting a classical mean square error loss function, and it is noted that the loss is calculated only in the probability value of the region within a distance R near the key point, that is, only those points near the key point are trained, so that the convergence of the network is facilitated, and the loss function is as shown.

With respect to training the offset vector, inspired by the target detection domain regression detection box coordinates, the present application employs a smooth loss function to penalize the gap between the true offset and the predicted offset, as shown.

The loss function can make the loss more robust to some abnormal outliers, thereby controlling the gradient to better propagate backwards in the network. Likewise, the present application only calculates the penalty at those locations that are no more than R from the keypoint. After fusing these two losses, the final loss function is shown in the formula, where λ_hAnd λ_oThe weights representing the two losses are respectively, the ratio is 4:1, and the optimizer used by the training model is adam.

L(θ)＝λ_hL_h(θ)+λ_oL_o(θ) (3)

In addition, the present application selects three test data sets disclosed in the field of attitude estimation to perform experimental measurements, to further illustrate the advantages of the present application over the prior art:

the operation environment of the embodiment of the application is as follows: 6 cores, Intel Xeon E5-2620 processor, 64GB memory, Titan X display card, Ubuntu 16.04 operating system.

The three data sets were: (1) MSCOCO: the MSCOCO data set can be applied to tasks such as target detection, semantic segmentation, key point detection and the like. The patent mainly uses a COCO data set of 2017, wherein a training set comprises 118287 pictures, a testing set comprises 5000 pictures, and no picture has labels of multiple characters.

(2) MPII human pose data sets are the most advanced criteria for assessing articulated human pose estimates. The data set includes approximately 25K images containing over 40K of people with annotated human joints. These images are collected according to a classification system of human daily activities. The entire data set covers 410 human activities, with activity labels for each image. Each image is extracted from the YouTube video. There are approximately 25000 pictures in the data, containing over 4 million unlabeled human keypoint instances, of which 28000 are used for network training, and the remaining 12000 samples are used for testing.

(3) We also evaluated our approach in a CrowdPose dataset, which contained 2 million pictures and 8 million human instances. The crowd pose dataset is designed to improve performance in crowded situations, making the model suitable for different scenarios.

In order to evaluate the effectiveness of the algorithm, the experiment of this embodiment employs AP and PCK performance evaluation indexes, where AP is used as an evaluation index in the COCO and crowpost data set, and PCK is used as an evaluation index in the MPII data set. Object keypoint similarity oks (object keypoint similarity), which is used to calculate the similarity between predicted keypoints and annotated keypoints, is formulated as follows:

wherein D_iRepresenting Euclidean distances between predicted and labeled keypoints, s being the scale of the object, k_iIs a key control constant, v, controlling attenuation_iIndicating whether the keypoint is visible. Given the OKS threshold s, the average accuracy over the test set can then be calculated from the following equation:

another important criterion for keypoints is PCK, which indicates the proportion of all predicted keypoints that fall within a certain standardized distance around the corresponding labeled keypoint. This normalized distance is often related to the longest distance of the human torso in the picture. Generally, the normalized distance is represented as PCK @ σ, where σ is a decimal between intervals [0,1], and the normalized distance in the evaluation index is obtained by multiplying σ by the longest trunk distance, and the specific calculation method is as follows:

where N represents the total number of samples and k represents the kth individual key point, so the overall PCK is:

the evaluation index used on the MPII dataset is PCKh, and unlike PCK, it replaces the longest torso distance used in normalizing the distance with the longest head distance.

In the embodiment of the application, the AP and PCK values of other algorithms are compared on three data sets of MSCOCO, MPII and CROWPOSE respectively. These methods include Simple Human body pose estimation and tracking baseline (SB), accurate multi-person pose estimation in the world (G-RMI), Cascaded pyramid network for multi-person pose estimation (CPN), Stacked Hourglass network for Human body pose estimation (FPN), Quantized tightly connected mesh-like networks for Human body pose estimation (heat regression for Human body pose estimation, FPN), Quantized tightly connected convolution-aided estimation (heat map estimation), abbreviated Vcph). The algorithm of the present application is abbreviated as HOPE.

FIG. 5 shows the results of the present application and other algorithms on the MSCOCO data set; FIG. 6 is the results of the present application and other algorithms on MPII data sets; fig. 7 shows the results of the application and other algorithms on the cordpost data set.

As can be seen from fig. 5, 6 and 7, the AP and PCK values of the present application are superior to those of other algorithms in both sparse and crowded scenarios. In addition, fig. 8 shows the detection result of the HOPE on the mscoco data set, fig. 9 shows the detection result of the HOPE on the MPII data set, and fig. 10 shows the detection result of the HOPE on the CROWPOSE data set. As can be seen from fig. 8 to fig. 10, the correlation of the results returned by the present application in the detection of the key points is relatively high, which further illustrates that the present application has a better effect in the detection of the key points.

The embodiment of the invention also provides a posture identification system based on thermodynamic diagrams and offset vectors, which comprises the following steps:

the acquisition module is used for acquiring a target image to be identified;

at least one processor;

at least one memory for storing at least one program;

Embodiments of the present invention also provide a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The gesture recognition method based on the thermodynamic diagram and the offset vector is characterized by comprising the following steps:

acquiring a target image to be identified;

extracting the characteristics of the target image to be recognized;

predicting the positions of key points according to the extracted features;

2. The method for gesture recognition based on thermodynamic diagrams and offset vectors according to claim 1, wherein the step of feature extraction of the target image to be recognized comprises:

cutting the obtained target image to be recognized;

inputting each image obtained by cutting into a residual error network; and

3. The thermodynamic diagram and offset vector based pose recognition method of claim 2, wherein the residual network comprises five convolutional layers;

carrying out normalization processing on each channel; and

4. The thermodynamic diagram and offset vector-based pose recognition method according to claim 2, wherein the step of performing feature extraction on the target image to be recognized further comprises a decoding step, and the decoding step comprises:

inputting the obtained first feature map into a deconvolution structure;

decoding the first feature map by a deconvolution structure; and

and acquiring a characteristic response graph of each channel.

5. The thermodynamic diagram and offset vector based gesture recognition method of claim 4, wherein the predicting keypoint locations from the extracted features comprises:

acquiring thermodynamic diagrams from output results of the channels;

6. The thermodynamic diagram and offset vector based pose recognition method according to claim 4, wherein the step of modifying the predicted keypoints and determining the final positions of the keypoints comprises the steps of:

7. The thermodynamic diagram and offset vector based gesture recognition method of claim 6, further comprising the steps of:

8. A system for gesture recognition based on thermodynamic diagrams and offset vectors, comprising:

the acquisition module is used for acquiring a target image to be identified;

9. A system for gesture recognition based on thermodynamic diagrams and offset vectors, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-7.

10. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are for performing the method of any one of claims 1-7.