CN112989947A

CN112989947A - Method and device for estimating three-dimensional coordinates of human body key points

Info

Publication number: CN112989947A
Application number: CN202110183668.4A
Authority: CN
Inventors: 段祎婷; 王蔚; 应兴德; 丁泽震; 聂学成
Original assignee: Shanghai Yitu Network Science and Technology Co Ltd
Current assignee: Shanghai Yitu Network Science and Technology Co Ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-06-18
Anticipated expiration: 2041-02-08
Also published as: CN112989947B

Abstract

The application relates to the technical field of computer vision, in particular to a method and a device for estimating three-dimensional coordinates of human key points, which are used for detecting and obtaining thermodynamic diagrams each containing the human key points and two-dimensional coordinates of each human key point from an image to be identified; respectively inputting a key point thermodynamic diagram of any human body key point and the image to be recognized into a trained depth detection model aiming at each human body key point, and determining the depth information of the human body key point based on the key point thermodynamic diagram and the key point characteristics of the image to be recognized; determining the three-dimensional coordinates of each human body key point according to the depth information of each human body key point and the corresponding two-dimensional coordinates; and determining the three-dimensional coordinates of each human body key point according to the depth information and the corresponding two-dimensional coordinates of each human body key point and preset human body structure information, so that the accuracy of detecting the three-dimensional coordinates of the human body key points during human body action recognition can be improved.

Description

Method and device for estimating three-dimensional coordinates of human body key points

Technical Field

The application relates to the technical field of computer vision, in particular to a method and a device for estimating three-dimensional coordinates of key points of a human body.

Background

At present, human body action recognition is one of the most challenging research directions in the field of computer vision, and recognition of human body actions needs to be realized according to three-dimensional coordinates of each human body key point contained in a human body, so how to obtain the three-dimensional coordinates of the human body key points becomes a problem to be solved urgently.

In the related art, when obtaining the three-dimensional coordinates of the human body key points, a Neural network model may be determined based on a Neural network Search (NAS), and the two-dimensional coordinates of the human body key points are directly converted into three-dimensional coordinates through the determined Neural network model. When the two-dimensional coordinates are directly converted into the three-dimensional coordinates, some spatial information is lacked, the depth information of the key points of the human body in the three-dimensional space is difficult to predict, and depth ambiguity can be caused.

Disclosure of Invention

The embodiment of the application provides a method and a device for estimating three-dimensional coordinates of human key points, so as to improve the accuracy of the detection of the three-dimensional coordinates of the human key points.

The embodiment of the application provides the following specific technical scheme:

a method for estimating three-dimensional coordinates of key points of a human body comprises the following steps:

detecting and obtaining thermodynamic diagrams each containing human body key points and two-dimensional coordinates of each human body key point from an image to be recognized, wherein the image to be recognized contains a human body, and the human body contains each human body key point;

respectively inputting thermodynamic diagrams of any human body key point and the image to be recognized into a trained depth detection model aiming at each human body key point, determining key point characteristics of the human body key point, and determining depth information of the human body key point based on the key point characteristics, wherein the depth detection model is obtained by iterative training according to each image sample to be recognized, each corresponding thermodynamic diagram sample and real depth information of each human body key point in each image sample to be recognized, and the depth information represents that the human body key point is positioned at the front side or the rear side of a preset human body calibration point;

and determining the three-dimensional coordinates of the key points of the human body according to the depth information of the key points of the human body and the corresponding two-dimensional coordinates.

Optionally, detecting and obtaining thermodynamic diagrams each including a human body key point and two-dimensional coordinates of each human body key point from the image to be recognized specifically includes:

based on a trained two-dimensional coordinate recognition model, an image to be recognized is taken as an input parameter, the image to be recognized is converted into a preset plurality of key point thermodynamic diagrams in a preset image conversion mode, pixel points corresponding to the maximum heat value are selected from all the thermodynamic values contained in all the thermodynamic diagrams respectively and serve as human key points of the image to be recognized, and two-dimensional coordinates of all the human key points in the image to be recognized are obtained.

Optionally, the two-dimensional coordinate recognition model is trained in the following manner:

acquiring a first image sample set, wherein the first image sample set comprises each image sample to be identified and a corresponding sample label, and the sample label represents a real two-dimensional coordinate of each human body key point contained in the image sample to be identified;

respectively inputting any image sample to be recognized into an initial two-dimensional coordinate recognition model aiming at each image sample to be recognized, determining a predicted two-dimensional coordinate of each human body key point in the image sample to be recognized, respectively calculating an error value between each predicted two-dimensional coordinate and a corresponding sample label, and adjusting each parameter of the initial two-dimensional coordinate recognition model until the calculated error value is minimized, thereby obtaining the trained two-dimensional coordinate recognition model.

Optionally, the training mode of the depth detection model is as follows:

respectively inputting any key point thermodynamic diagram sample and a corresponding image sample to be recognized into an initial depth detection model aiming at each key point thermodynamic diagram sample, determining predicted depth information of human key points contained in the key point thermodynamic diagram sample, calculating an error value between the predicted depth information and a depth label, and adjusting each parameter of the initial depth detection model until the calculated error value is minimized, thereby obtaining the trained depth detection model.

Optionally, determining the three-dimensional coordinates of each human body key point according to the depth information of each human body key point and the corresponding two-dimensional coordinates, specifically including:

and respectively aiming at each human body key point, combining the two-dimensional coordinates and the depth information of any human body key point by taking the two-dimensional coordinates and the depth information of any human body key point as input parameters based on a trained coordinate conversion model to generate initial three-dimensional coordinates, and obtaining the three-dimensional coordinates of the human body key point relative to the human body calibration point through regression of all-connected layers at all levels, wherein the coordinate conversion model at least comprises a plurality of all-connected layers.

Optionally, the training mode of the coordinate transformation model is as follows:

acquiring a second image sample set, wherein the second image sample set at least comprises a plurality of two-dimensional coordinate samples, corresponding depth information samples and coordinate labels;

respectively inputting any two-dimensional coordinate sample and a corresponding depth information sample into an initial coordinate conversion model aiming at each two-dimensional coordinate sample, determining a predicted three-dimensional coordinate of the key point of the human body, calculating an error value between the predicted three-dimensional coordinate and a coordinate label, adjusting each parameter of the initial coordinate conversion model until the calculated error value is minimized, and obtaining the coordinate conversion model after training.

and respectively determining the three-dimensional coordinates of the human key points according to the depth information and the two-dimensional coordinates of any human key point and preset coordinate mapping information aiming at each human key point, wherein the coordinate mapping information represents the depth information of the human key points and the mapping relation between the two-dimensional coordinates and the three-dimensional coordinates.

and determining the three-dimensional coordinates of the key points of the human body according to the depth information and the corresponding two-dimensional coordinates of the key points of the human body and preset human body structure information.

An apparatus for estimating three-dimensional coordinates of key points of a human body, comprising:

the device comprises a first detection module, a second detection module and a third detection module, wherein the first detection module is used for detecting and obtaining thermodynamic diagrams each containing human key points and two-dimensional coordinates of each human key point from an image to be recognized, the image to be recognized contains a human body, and the human body contains each human key point;

the second detection module is used for inputting thermodynamic diagrams of any human body key point and the image to be recognized into a trained depth detection model respectively aiming at each human body key point, determining key point characteristics of the human body key point, and determining depth information of the human body key point based on the key point characteristics, wherein the depth detection model is obtained through iterative training according to each image sample to be recognized, each corresponding thermodynamic diagram sample and the real depth information of each human body key point in each image sample to be recognized, and the depth information represents that the human body key point is positioned on the front side or the rear side of a preset human body calibration point;

and the determining module is used for determining the three-dimensional coordinates of the key points of the human body according to the depth information of the key points of the human body and the corresponding two-dimensional coordinates.

Optionally, the first detection module is specifically configured to:

Optionally, when the two-dimensional coordinate recognition model is trained, the method further includes:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a first image sample set, the first image sample set comprises each image sample to be identified and a corresponding sample label, and the sample label represents the real two-dimensional coordinates of each human body key point contained in the image sample to be identified;

the first training module is used for inputting any image sample to be recognized into an initial two-dimensional coordinate recognition model aiming at each image sample to be recognized, determining the predicted two-dimensional coordinates of each human body key point in the image sample to be recognized, calculating error values between each predicted two-dimensional coordinate and a corresponding sample label, adjusting each parameter of the initial two-dimensional coordinate recognition model until the calculated error values are minimized, and obtaining the trained two-dimensional coordinate recognition model.

Optionally, when the depth detection model is trained, the method further includes:

and the second training module is used for respectively inputting any one key point thermodynamic diagram sample and the corresponding image sample to be recognized into the initial depth detection model aiming at each key point thermodynamic diagram sample, determining the predicted depth information of the human key point contained in the key point thermodynamic diagram sample, calculating an error value between the predicted depth information and the depth label, adjusting each parameter of the initial depth detection model until the calculated error value is minimized, and obtaining the trained depth detection model.

Optionally, the determining module is specifically configured to:

Optionally, when training the coordinate transformation model, the method further includes:

the second acquisition module is used for acquiring a second image sample set, wherein the second image sample set at least comprises a plurality of two-dimensional coordinate samples, corresponding depth information samples and coordinate labels;

and the third training module is used for inputting any two-dimensional coordinate sample and the corresponding depth information sample into the initial coordinate conversion model respectively aiming at each two-dimensional coordinate sample, determining the predicted three-dimensional coordinate of the key point of the human body, calculating an error value between the predicted three-dimensional coordinate and the coordinate label, adjusting each parameter of the initial coordinate conversion model until the calculated error value is minimized, and obtaining the coordinate conversion model after training.

Optionally, the determining module is specifically configured to:

An electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the estimation method of the three-dimensional coordinates of the human body key points when executing the program.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method for estimating three-dimensional coordinates of human key points.

In the embodiment of the application, the key point thermodynamic diagrams each containing the key point of the human body are detected and obtained from the image to be identified containing the human body, and two-dimensional coordinates of each human body key point, respectively inputting thermodynamic diagrams of any human body key point and an image to be recognized into the trained depth detection model aiming at each human body key point, determining key point characteristics of the human body key point, determining depth information of the human body key point based on the key point characteristics, determining the three-dimensional coordinates of each human body key point according to the depth information and the corresponding two-dimensional coordinates of each human body key point, thus, the depth information of each human body key point is obtained through the depth detection model, the three-dimensional coordinate is determined according to the depth information and the two-dimensional coordinate, the spatial information of the human body key points can be considered, the depth ambiguity in the three-dimensional coordinate is solved, and the detection accuracy of the three-dimensional coordinate of the human body key points is improved.

Drawings

FIG. 1 is a schematic diagram of a human body key point estimation algorithm based on a three-dimensional thermodynamic diagram in the related art;

FIG. 2 is a diagram illustrating a keypoint indirect estimation algorithm based on two-dimensional pose estimation in the related art;

fig. 3 is a flowchart of a human body motion recognition method based on three-dimensional key point coordinates in an embodiment of the present application;

FIG. 4 is a schematic diagram of a model structure in an embodiment of the present application;

FIG. 5 is a schematic diagram of coordinate transformation in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of an apparatus for estimating three-dimensional coordinates of key points of a human body according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the related art, methods for obtaining three-dimensional coordinates of key points of a human body can be divided into two main categories: one is a key point direct estimation algorithm based on three-dimensional thermodynamic diagram, and the other is a key point indirect estimation algorithm based on two-dimensional attitude estimation.

The first method comprises the following steps: the method based on the three-dimensional thermodynamic diagram is an algorithm for directly calculating three-dimensional human body key points, as shown in fig. 1, for a schematic diagram of a human body key point estimation algorithm based on the three-dimensional thermodynamic diagram in the related art, Red Green Blue (RGB) images of the human body key points to be estimated are input, and the three-dimensional thermodynamic diagram of the human body is calculated through a deep learning model, so that the RGB images can be directly mapped to coordinates of the three-dimensional human body key points, and the method is an end-to-end algorithm, but the method in the related art has no intermediate supervision process, and the model for directly mapping the RGB images to the three-dimensional human body key point coordinates is greatly influenced by the background, illumination and human wear of the picture, and features needing to be learned for a single model are too complex, the algorithm complexity is too high, and the labeling resources of three-dimensional postures are few and are limited by the number of training sets, the generalization ability of the model is not sufficient.

The second method comprises the following steps: the indirect key point estimation algorithm based on two-dimensional attitude estimation is an algorithm for indirectly calculating three-dimensional key point coordinates, as shown in fig. 2, which is a schematic diagram of the indirect key point estimation algorithm based on two-dimensional attitude estimation in the related art, and is input as an RGB image of a key point to be estimated, firstly, two-dimensional key point coordinates of a human body are estimated through a deep learning model, then based on the two-dimensional key point coordinates, directly estimating the three-dimensional key point coordinates of the human body through a deep learning model determined based on Neural network Search (NAS), when the two-dimensional coordinates are directly converted into the three-dimensional coordinates, some spatial information is lacked, the depth ambiguity is difficult to solve, therefore, the depth information of the human body key points in the three-dimensional space is difficult to predict, and thus, the detection accuracy is reduced when the three-dimensional coordinates of the human body key points are detected.

In the embodiment of the application, thermodynamic diagrams each containing key points of a human body are detected and obtained from the image to be identified, and two-dimensional coordinates of each human body key point, respectively inputting thermodynamic diagrams of any human body key point and an image to be recognized into the trained depth detection model aiming at each human body key point, determining the depth information of the human body key point, determining the three-dimensional coordinates of each human body key point according to the depth information and the corresponding two-dimensional coordinates of each human body key point, the human body in the image to be recognized is recognized according to the determined three-dimensional coordinates of the key points of the human body, so that, obtaining the depth information of each human body key point contained in the image to be recognized through a depth detection model, and the three-dimensional coordinates of the key points of the human body are determined based on the depth information and the corresponding two-dimensional coordinates, so that the depth ambiguity can be solved, and the accuracy of three-dimensional coordinate identification is improved. In addition, the original image to be recognized is used for extracting the depth information of the key points of the human body, so that richer depth information can be obtained, and the accuracy of three-dimensional coordinate recognition is improved.

Based on the above embodiment, referring to fig. 3, a flowchart of a method for estimating three-dimensional coordinates of key points of a human body in the embodiment of the present application is shown, and specifically includes:

step 300: thermodynamic diagrams each containing human body key points and two-dimensional coordinates of the human body key points are obtained through detection from the image to be recognized.

The image to be recognized comprises a human body, and the human body comprises key points of each human body.

In the embodiment of the present application, when step 300 is executed, specifically, the method includes:

based on a trained two-dimensional coordinate recognition model, an image to be recognized is used as an input parameter, the image to be recognized is converted into a plurality of preset key point thermodynamic diagrams in a preset image conversion mode, pixel points corresponding to the maximum heat value are selected from all the thermodynamic values contained in all the thermodynamic diagrams respectively and serve as human key points of the image to be recognized, and two-dimensional coordinates of all the human key points in the image to be recognized are obtained.

In the embodiment of the present application, an image to be recognized is input into a trained two-dimensional coordinate recognition model, the following processing is performed, and finally, each human body key point included in the image to be recognized and a two-dimensional coordinate corresponding to each human body key point are included in the image to be recognized, and the following elaboration is performed on a processing procedure of the two-dimensional coordinate recognition model in the embodiment of the present application, which specifically includes:

firstly, converting an image to be recognized into a preset number of key point thermodynamic diagrams in a preset image conversion mode.

The method comprises the steps of obtaining a plurality of key points of a human body from an image to be recognized, converting the image to be recognized into a plurality of key point thermodynamic diagrams, wherein the image size of each key point thermodynamic diagram is the same as that of the image to be recognized. Moreover, the image size of the image to be recognized is the same as that of the key point thermodynamic diagram, so that the image to be recognized and the key point thermodynamic diagram contain the same number of pixel points.

The two-dimensional coordinate recognition model may be, for example, a Convolutional Neural Network (CNN), which is not limited in the embodiment of the present application.

Then, since the key point thermodynamic diagrams are images generated by the thermodynamic values of the pixel points, after the images to be recognized are converted into the key point thermodynamic diagrams, the thermodynamic values corresponding to the pixel points included in the key point thermodynamic diagrams are respectively obtained.

Each key point thermodynamic diagram comprises pixel points, and each pixel point corresponds to a thermal force value.

Then, after acquiring the thermodynamic values corresponding to the pixel points in any one key point thermodynamic diagram, respectively aiming at each key point thermodynamic diagram, selecting a pixel point corresponding to the maximum thermodynamic value from the acquired thermodynamic values, and taking the selected pixel point as a human body key point contained in the key point thermodynamic diagram, that is, each pixel point corresponds to one thermodynamic value, and the thermodynamic value is a numerical value.

Finally, the human body key point is a pixel point in the image to be recognized, and the position information of each pixel point in the image to be recognized is fixedly known, so that after the human body key point is determined, the two-dimensional coordinates of the human body key point in the image to be recognized can be determined according to the position information of the pixel point in the image to be recognized.

For example, the number of the human body key points may be 17, which are respectively a vertex key point, a nose key point, a neck key point, a chest key point, a hip key point, a left shoulder key point, a right shoulder key point, a left elbow key point, a right elbow key point, a left hip key point, a right hip key point, a left knee key point, a right knee key point, a left ankle key point, and a right ankle key point, which is not limited in the embodiment of the present application.

For example, assuming that each human body has 17 human body key points, partitioning the image to be recognized to obtain 17 region images, respectively inputting the 17 region images to the trained two-dimensional coordinate recognition model to obtain 17 key point thermodynamic diagrams, wherein each key point thermodynamic diagram represents one human body key point, and obtaining the two-dimensional coordinate of the position of each human body key point in the image to be recognized by determining the maximum value of the thermal values contained in each key point thermodynamic diagram.

It should be noted that the image to be recognized in the embodiment of the present application may be an image only including a human body, the image to be recognized may also be an image including a human body and other objects, if the image to be recognized also includes other objects, the image to be recognized needs to be subjected to human body detection first, the human body is marked out from the image to be recognized through the external rectangular frame, and the image only including the human body is obtained by capturing, so that the image obtained after capturing is used as the image to be recognized that needs to be partitioned.

The image to be recognized may be an RGB image, for example.

Before recognizing each human body key point contained in an image to be recognized through a two-dimensional coordinate recognition model, firstly, the two-dimensional coordinate recognition model needs to be trained, and the following elaborations on the training mode of the two-dimensional coordinate recognition model in the embodiment of the application specifically include:

s1: a first set of image samples is acquired.

The first image sample set comprises all area image samples and corresponding sample labels, and the sample labels represent real two-dimensional coordinates of human key points contained in the area image samples.

In an embodiment of the present application, a first image sample set is obtained, where the first image sample set includes groups of sample pairs, and each sample pair includes a region image sample, a thermodynamic diagram corresponding to the region image sample, and a sample label corresponding to the region image sample.

It should be noted that the sample label is a real two-dimensional coordinate of a human body key point included in the area image sample, and the sample label is pre-labeled.

S2: respectively inputting any one area image sample into the initial two-dimensional coordinate recognition model aiming at each area image sample, determining the predicted two-dimensional coordinates of the human body key points in the area image samples, calculating error values between the predicted two-dimensional coordinates and corresponding sample labels, adjusting each parameter of the initial two-dimensional coordinate recognition model until the calculated error values are minimized, and obtaining the trained two-dimensional coordinate recognition model.

In the embodiment of the present application, for each area image sample in the first image sample set, any one area image sample is input into the initial two-dimensional coordinate recognition model, and the following processing is performed:

first, a region image sample is subjected to gaussian transformation processing, and the region image sample is converted into a keypoint thermodynamic diagram.

Then, because the key point thermodynamic diagram is an image generated by the thermodynamic values of all the pixel points, after the key point thermodynamic diagram is converted, the thermodynamic values corresponding to all the pixel points included in the key point thermodynamic diagram are obtained, the pixel point corresponding to the maximum thermodynamic value is selected from the obtained thermodynamic values, the pixel point corresponding to the maximum thermodynamic value is used as the final predicted human body key point, and the predicted two-dimensional coordinates of the human body key point in the area image sample are obtained.

And finally, calculating an error value between the predicted two-dimensional coordinate and the corresponding sample label, and adjusting each parameter of the initial two-dimensional coordinate identification model based on the calculated error value.

In this way, each regional image sample in the first image sample set is output to the initial two-dimensional coordinate recognition model for iterative training until the calculated error value is minimized, and the trained two-dimensional coordinate recognition model is obtained and is a stable model.

Step 310: respectively inputting thermodynamic diagrams of any human body key point and an image to be recognized into a trained depth detection model aiming at each human body key point, determining key point characteristics of the human body key point, and determining depth information of the human body key point based on the key point characteristics.

The depth detection model is obtained through iterative training according to each image sample to be recognized, each corresponding thermodynamic diagram sample and real depth information of each human body key point in each image sample to be recognized, and the depth information represents that the human body key point is positioned on the front side or the rear side of a preset human body calibration point.

In the embodiment of the application, in order to improve the accuracy of identifying the three-dimensional coordinates of the human key points, the depth information of the human key points is introduced in the embodiment of the application, and the depth information can represent the position information of the human key points relative to the calibration points, so that the accuracy of identifying the three-dimensional coordinates of the human key points can be improved when the three-dimensional coordinates of the human key points are identified. The following is a detailed description of the steps for obtaining depth information of key points of a human body in the embodiment of the present application:

firstly, feature extraction is carried out on an image to be recognized to obtain image features of the image to be recognized, feature extraction is carried out on any key point thermodynamic diagram to obtain thermodynamic diagram features of the key point thermodynamic diagrams.

And then, combining the image characteristic and the thermodynamic diagram characteristic, and determining the key point characteristic of the human body key point.

And finally, carrying out secondary classification on the depths of the human body key points based on the key point characteristics to obtain the depth information of the human body key points relative to the human body calibration points.

The depth detection model may be, for example, a Residual Network (ResNet), which is not limited in this embodiment of the application.

It should be noted that the human body calibration point in the embodiment of the present application can be classified into the following three cases:

in the first case: the human body calibration point is a preset point.

The human body calibration point in the embodiment of the present application may be a preset point, and the human body calibration point is not included in each human body key point obtained by detection. Therefore, when determining the depth information of each human body key point relative to the human body calibration point, the number of the obtained depth information is the same as the number of the human body key points.

For example, assuming that a human body has 17 human body key points in total, after obtaining a key point thermodynamic diagram corresponding to each human body key point, 17 key point thermodynamic diagrams are input into a trained depth detection model, and depth information of the 17 human body key points is subjected to secondary classification to obtain depth information of each human body key point in the 17 human body key points relative to a preset human body calibration point, that is, the human body key points are located on the front side or the rear side of the human body.

In the second case: the human body calibration point is one of the key points of the human body.

The human body key point in the embodiment of the application can be any one of the human body key points. Therefore, when determining the depth information of each of the body key points other than the body index point with respect to the body index point, the number of the obtained depth information will be smaller than the number of the body key points.

For example, assuming that a human body has 17 human body key points in total and a human body calibration point is a vertex key point, after obtaining a key point thermodynamic diagram corresponding to each human body key point, 17 key point thermodynamic diagrams are input into a trained depth detection model, and the depths of the human body key points are subjected to secondary classification to obtain depth information of each human body key point except the vertex key point relative to the vertex key point. Therefore, each of the obtained depth information is depth information of the other 16 human body key points with respect to the vertex key point, and the number of the depth information is 16.

In the third case: the human body calibration points are key points of the buttocks.

In the embodiment of the application, in order to improve the accuracy of depth detection, when selecting a human body calibration point, a point which is not easy to move and is more stable can be selected, therefore, the human body calibration point in the embodiment of the application can also be a hip key point, and the hip key point is a middle point of a connecting line of a left hip key point and a right hip key point. In this way, when determining the depth information of each human body key point except the hip key point relative to the hip key point, the number of the obtained depth information is smaller than the number of the human body key points.

For example, assuming that a human body has 17 human body key points in total, after obtaining a key point thermodynamic diagram corresponding to each human body key point, 17 key point thermodynamic diagrams are input into a trained depth detection model, and the depths of the human body key points are subjected to secondary classification, so that depth information of each human body key point except for a hip key point relative to the hip key point is obtained. Therefore, each of the obtained depth information is depth information of the other 16 human body key points with respect to the hip key point, and the number of the depth information is 16. The following explains a training process of the depth detection model in the embodiment of the present application, and specifically includes:

s1: a second set of image samples is acquired.

The second image sample set at least comprises a plurality of key point thermodynamic diagram samples, image samples to be identified and corresponding depth labels.

In an embodiment of the present application, a second image sample set is obtained, where the second image sample set includes groups of sample pairs, and each sample pair includes a keypoint thermodynamic diagram sample, an image sample to be identified corresponding to the keypoint thermodynamic diagram sample, and a depth label corresponding to the keypoint thermodynamic diagram sample.

It should be noted that the depth label is real depth information of the human body key points included in the key point thermodynamic diagram sample, and the depth label is pre-labeled.

S2: respectively inputting any one key point thermodynamic diagram sample and a corresponding image sample to be recognized into an initial depth detection model aiming at each key point thermodynamic diagram sample, determining predicted depth information of human key points contained in the key point thermodynamic diagram sample, calculating an error value between the predicted depth information and a depth label, and adjusting each parameter of the initial depth detection model until the calculated error value is minimized to obtain a trained depth detection model.

In the embodiment of the application, for each key point thermodynamic diagram sample, any one key point thermodynamic diagram sample and a corresponding image sample to be identified are input into an initial depth detection model, and the following processing is performed on the key point thermodynamic diagram sample and the corresponding image sample to be identified: the method comprises the steps of extracting features of a key point thermodynamic diagram, obtaining thermodynamic diagram features of the key point thermodynamic diagram, extracting features of an image to be recognized, obtaining image features of the image to be recognized, combining the thermodynamic diagram features and the image features, determining key point features of key points contained in the image of a region, determining predicted depth information of key points of a human body according to the key point features, calculating error values between the predicted depth information and depth labels, adjusting all parameters of an initial depth detection model according to the calculated error values until the calculated error values are minimized, and obtaining a trained depth detection model.

In training the depth detection model, only the parameters of the depth detection model are updated while keeping the parameters of the two-dimensional coordinate recognition model fixed. Specifically, the input is an RGB graph and a key point thermodynamic diagram of human key points, and two classification prediction results, namely front and back depth information of relative root nodes, are obtained through a ResNet 18 or other CNN models. And then calculating the error between the prediction result and the label through a two-classification cross entropy loss function, and continuously updating the network parameters by utilizing back propagation to ensure that the error is smaller and smaller until convergence, so that a stable depth detection model can be obtained.

Step 320: and determining the three-dimensional coordinates of the key points of the human body according to the depth information and the corresponding two-dimensional coordinates of the key points of the human body.

In the embodiment of the application, after the depth information and the corresponding two-dimensional coordinates of each human body key point are determined, the three-dimensional coordinates of the human body key points are determined according to the depth information and the corresponding two-dimensional coordinates of any human body key point aiming at each human body key point.

Three ways of determining the three-dimensional coordinates of each human body key point in the embodiments of the present application are described in detail below, but not limited to the following three ways.

The first mode is as follows:

the method specifically comprises the following steps: and respectively aiming at each human body key point, combining the two-dimensional coordinates and the depth information of any human body key point by taking the two-dimensional coordinates and the depth information of any human body key point as input parameters based on a trained coordinate conversion model to generate initial three-dimensional coordinates, and obtaining the three-dimensional coordinates of the human body key point relative to the human body calibration point through regression of all-connected layers at all levels.

The coordinate conversion model at least comprises a multi-level full connection layer.

In the embodiment of the application, for each human body key point, the two-dimensional coordinates and the depth information of any human body key point are input into a trained coordinate conversion model, the depth information is quantized to obtain the quantized depth information, the x-axis coordinates and the y-axis coordinates in the two-dimensional coordinates are used as the x-axis coordinates and the y-axis coordinates in the three-dimensional coordinates of the human body key point, the quantized depth information is used as the z-axis coordinates in the three-dimensional coordinates to obtain the initial three-dimensional coordinates, and then the three-dimensional coordinates of the human body key point relative to the human body calibration point are obtained through regression of all connecting layers in the coordinate conversion model.

For example, assuming that the two-dimensional coordinates of the left-hand key point are (15,16) and the depth information is on the front side of the human body calibration point, the depth information is quantized to 1, the x-axis coordinate 15 in the two-dimensional coordinates is taken as the x-axis coordinate 15 in the three-dimensional coordinates of the human body key point, the y-axis coordinate in the two-dimensional coordinates is taken as the y-axis coordinate 16 in the three-dimensional coordinates of the human body key point, and the quantized depth information is taken as the z-axis coordinate 1 in the three-dimensional coordinates, so as to obtain the initial three-dimensional coordinates of (15,16,1), and the initial three-dimensional coordinates are subjected to coordinate regression processing through all connection layers of each level, so as to obtain the three-dimensional coordinates (15,16,5) of the human body key point relative to the human.

The following describes in detail the training steps of the coordinate transformation model in the embodiment of the present application, and specifically includes:

s1: a third sample set of images is acquired.

Wherein the third image sample set comprises at least a plurality of two-dimensional coordinate samples, corresponding depth information samples and coordinate labels.

S2: respectively inputting any two-dimensional coordinate sample and a corresponding depth information sample into an initial coordinate conversion model aiming at each two-dimensional coordinate sample, determining a predicted three-dimensional coordinate of the key point of the human body, calculating an error value between the predicted three-dimensional coordinate and a coordinate label, adjusting each parameter of the initial coordinate conversion model until the calculated error value is minimized, and obtaining a trained coordinate conversion model, wherein the trained coordinate conversion model is shown in figure 4 and is a model structure schematic diagram in the embodiment of the application.

In training the coordinate conversion model, only the parameters included in the coordinate conversion model are updated while keeping the parameters of the two-dimensional coordinate detection model and the parameters of the depth detection model fixed.

For example, assuming that the human body calibration point is a left hip key point, two-dimensional coordinate samples of 16 key points other than the left hip key point and depth information samples of the 16 key points relative to the left hip key point can be obtained through a two-dimensional coordinate recognition model and a depth detection model, then the two-dimensional coordinate samples and the depth information samples of the 16 human body key points are spliced to form a 16 × 3 matrix, the generated matrix is input into an initial coordinate conversion model, the initial coordinate conversion model may be, for example, a CNN or GCN model, and predicted three-dimensional coordinates of the human body key points are output. Then, an L2 error between the predicted three-dimensional coordinate and the coordinate tag is calculated, and each parameter included in the initial coordinate conversion model is continuously updated by using back propagation, so that the error becomes smaller and smaller until convergence, and at this time, a stable coordinate conversion model can be obtained, which is shown in fig. 5 and is a schematic diagram of coordinate conversion in the embodiment of the present application.

The second mode is as follows:

the method specifically comprises the following steps: and respectively determining the three-dimensional coordinates of the human key points according to the depth information and the two-dimensional coordinates of any human key point and preset coordinate mapping information aiming at each human key point.

The coordinate mapping information represents depth information of the key points of the human body, and a mapping relation between the two-dimensional coordinates and the three-dimensional coordinates.

In the embodiment of the application, the depth information and the two-dimensional coordinates of each human body key point correspond to one three-dimensional coordinate, and different depth information and two-dimensional coordinates correspond to different three-dimensional coordinates. Therefore, a mapping relation exists between the depth information and the two-dimensional coordinates and the three-dimensional coordinates, the mapping relation is coordinate mapping information, and the three-dimensional coordinates of the human key points can be determined directly based on the depth information and the two-dimensional coordinates of the human key points according to the mapping relation of the human key points.

The coordinate mapping information may be, for example, a coordinate mapping table, where the coordinate mapping table is generated according to multiple sets of prior information obtained when determining the three-dimensional coordinates of the human body key points, and this is not limited in this embodiment of the application.

The third mode is as follows:

the method specifically comprises the following steps: and determining the three-dimensional coordinates of each human body key point according to the depth information and the corresponding two-dimensional coordinates of each human body key point and preset human body structure information.

In the embodiment of the application, the limb length of each limb of the human body can be obtained based on the two-dimensional coordinates of each key point of the human body, and the limb length of each limb of the human body is the human body structure information, so that the z-axis coordinate in the three-dimensional coordinates of each key point of the human body can be obtained according to each limb length after each limb length of the human body is obtained. And finally, combining the two-dimensional coordinates, the depth information and the corresponding z-axis coordinates of each human body key point to determine the three-dimensional coordinates of each human body key point.

For example, when calculating the three-dimensional coordinates of the left elbow key point, the arm length of the human body can be determined through the two-dimensional coordinates of the left shoulder key point, the left wrist key point and the left elbow key point, so that the three-dimensional coordinates of the left elbow key point can be determined according to the determined arm length, the depth information of the left elbow key point and the two-dimensional coordinates.

Further, in the embodiment of the application, when the two-dimensional coordinate recognition model, the depth detection model and the coordinate conversion model are obtained, the determination can be performed through the NAS. Taking a two-dimensional coordinate recognition model as an example, manually designing a baseline model according to the target speed of the two-dimensional coordinate recognition model, taking the baseline model as a template, selecting an adjustable hyper-parameter in the baseline model as an object of NAS search, finally determining an optimized structure of the baseline model based on a preselected search strategy and the adjustable hyper-parameter in the baseline model, and determining the optimized structure of the baseline model as the finally obtained two-dimensional coordinate recognition model.

In the embodiment of the application, key point thermodynamic diagrams each containing human key points and two-dimensional coordinates of each human key point are obtained by detecting an image to be recognized containing a human body, the key point thermodynamic diagrams of any human key point and the image to be recognized are input into a trained depth detection model aiming at each human key point, and the depth information of the human key point is determined based on the key point thermodynamic diagrams and key point characteristics of the image to be recognized; the method comprises the steps of determining three-dimensional coordinates of each human key point according to depth information of each human key point and the corresponding two-dimensional coordinates, and identifying the motion of a human body in an image to be identified according to the determined three-dimensional coordinates of each human key point, so that accurate results can be obtained by using the two-dimensional coordinates and the depth information through a simple full connection layer under the condition that three-dimensional labeling resources are not increased.

Based on the same inventive concept, the embodiment of the present application provides an estimation apparatus for three-dimensional coordinates of human body key points, and the estimation apparatus for three-dimensional coordinates of human body key points may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiment, referring to fig. 6, a schematic structural diagram of an estimation apparatus for three-dimensional coordinates of human body key points in the embodiment of the present application specifically includes:

the first detection module 600 is configured to detect and obtain thermodynamic diagrams each including a human body key point and two-dimensional coordinates of each human body key point from an image to be recognized, where the image to be recognized includes a human body, and the human body includes each human body key point;

a second detection module 601, configured to input thermodynamic diagrams of any one human body key point and the image to be recognized into a trained depth detection model respectively for the human body key points, determine key point features of the human body key point, and determine depth information of the human body key point based on the key point features, where the depth detection model is obtained through iterative training according to each image sample to be recognized, corresponding thermodynamic diagram samples, and real depth information of each human body key point in each image sample to be recognized, and the depth information represents that the human body key point is located on a front side or a rear side of a preset human body calibration point;

a determining module 602, configured to determine a three-dimensional coordinate of each human body key point according to the depth information of each human body key point and the corresponding two-dimensional coordinate.

Optionally, the first detecting module 600 is specifically configured to:

a first obtaining module 603, configured to obtain a first image sample set, where the first image sample set includes each to-be-identified image sample and a corresponding sample label, and the sample label represents a real two-dimensional coordinate of each human body key point included in the to-be-identified image sample;

the first training module 604 is configured to input any one image sample to be recognized into an initial two-dimensional coordinate recognition model respectively for each image sample to be recognized, determine predicted two-dimensional coordinates of each human body key point in the image sample to be recognized, calculate error values between each predicted two-dimensional coordinate and a corresponding sample label respectively, and adjust each parameter of the initial two-dimensional coordinate recognition model until the calculated error value is minimized, so as to obtain the trained two-dimensional coordinate recognition model.

the second training module 605 is configured to input any one key point thermodynamic diagram sample and a corresponding image sample to be recognized into an initial depth detection model respectively for each key point thermodynamic diagram sample, determine predicted depth information of a human body key point included in the key point thermodynamic diagram sample, calculate an error value between the predicted depth information and a depth label, and adjust each parameter of the initial depth detection model until the calculated error value is minimized, so as to obtain the trained depth detection model.

Optionally, the determining module 602 is specifically configured to:

a second obtaining module 606, configured to obtain a second image sample set, where the second image sample set at least includes a plurality of two-dimensional coordinate samples, corresponding depth information samples, and coordinate labels;

a third training module 607, configured to input any one two-dimensional coordinate sample and a corresponding depth information sample into an initial coordinate conversion model respectively for each two-dimensional coordinate sample, determine a predicted three-dimensional coordinate of the human body key point, calculate an error value between the predicted three-dimensional coordinate and a coordinate label, and adjust each parameter of the initial coordinate conversion model until the calculated error value is minimized, so as to obtain the coordinate conversion model after training.

Optionally, the determining module 602 is specifically configured to:

Based on the above embodiments, referring to fig. 7, a schematic structural diagram of an electronic device in an embodiment of the present application is shown.

Embodiments of the present disclosure provide an electronic device, which may include a processor 710 (CPU), a memory 720, an input device 730, an output device 740, and the like, wherein the input device 730 may include a keyboard, a mouse, a touch screen, and the like, and the output device 740 may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like.

Memory 720 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 710 with program instructions and data stored in memory 720. In the embodiment of the present application, the memory 720 may be used to store a program of any one of the methods for estimating three-dimensional coordinates of human key points in the embodiment of the present application.

The processor 710 is configured to execute any method for estimating three-dimensional coordinates of human body key points according to the obtained program instructions by calling the program instructions stored in the memory 720, by the processor 710.

Based on the above embodiments, in the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for estimating three-dimensional coordinates of human body key points in any of the above method embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for estimating three-dimensional coordinates of key points of a human body is characterized by comprising the following steps:

2. The method of claim 1, wherein detecting and obtaining thermodynamic diagrams each including human body key points and two-dimensional coordinates of each human body key point from the image to be recognized specifically comprises:

3. The method of claim 2, wherein the two-dimensional coordinate recognition model is trained by:

4. The method of claim 1, wherein the depth detection model is trained by:

5. The method according to claim 1, wherein determining the three-dimensional coordinates of the body keypoints according to the depth information and the corresponding two-dimensional coordinates of the body keypoints specifically comprises:

6. The method of claim 5, wherein the coordinate transformation model is trained by:

7. The method according to claim 1, wherein determining the three-dimensional coordinates of the body keypoints according to the depth information and the corresponding two-dimensional coordinates of the body keypoints specifically comprises:

8. The method according to claim 1, wherein determining the three-dimensional coordinates of the body keypoints according to the depth information and the corresponding two-dimensional coordinates of the body keypoints specifically comprises:

9. An apparatus for estimating three-dimensional coordinates of key points of a human body, comprising:

10. The apparatus of claim 9, wherein the first detection module is specifically configured to:

11. The apparatus of claim 10, wherein when training the two-dimensional coordinate recognition model, further comprises:

12. The apparatus of claim 10, wherein in training the depth detection model, further comprises:

13. The apparatus of claim 10, wherein the determination module is specifically configured to:

14. The apparatus of claim 13, wherein when training the coordinate transformation model, further comprises:

15. The apparatus of claim 10, wherein the determination module is specifically configured to:

16. The apparatus of claim 10, wherein the determination module is specifically configured to:

17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1-8 are implemented when the program is executed by the processor.

18. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 8.