CN111626220A

CN111626220A - Method, device, medium and equipment for estimating three-dimensional postures of multiple persons

Info

Publication number: CN111626220A
Application number: CN202010467565.6A
Authority: CN
Inventors: 袁潮; 温建伟; 方璐; 李广涵; 赵月峰
Original assignee: Beijing Zhuohe Technology Co Ltd
Current assignee: Shenzhen Zhuohe Technology Co ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-04

Abstract

The invention discloses a method, a device, a medium and equipment for estimating a three-dimensional posture of multiple persons, wherein the method comprises the steps of acquiring two-dimensional coordinates of multiple nodes on each of multiple persons to be processed in an image; determining the depth of each human body in a corresponding scene in the image and the depth confidence of each human body in the corresponding scene in the image according to the two-dimensional coordinates of the nodes of each human body; and adjusting the depths of a plurality of human bodies in the corresponding scenes in the image according to the depth confidence. The method can acquire the depth of each person in the multi-person image in the same scene, can identify the postures of the persons in the image, has better three-dimensional posture estimation efficiency, can optimize the depth through the depth confidence coefficient, and improves the estimation accuracy.

Description

Method, device, medium and equipment for estimating three-dimensional postures of multiple persons

Technical Field

The invention relates to the technical field of graphic processing, in particular to a method, a device, a medium and equipment for estimating a three-dimensional posture of multiple persons.

Background

With the development of information technology and the popularization of intelligent technology, human body posture recognition technology has started to be widely applied in computer vision related fields, such as the fields of human-computer interaction, movie and television production, motion analysis, game entertainment and the like, and the human body posture recognition technology is researched and analyzed according to human body posture recognition in a fixed scene. The human body posture recognition technology in the prior art generally only recognizes single three-dimensional posture points, and the effect of the human body posture recognition technology on the three-dimensional posture of multiple people is poor.

Disclosure of Invention

In order to solve the problem of low face recognition efficiency in the prior art, a method, a device, a medium and equipment for estimating the three-dimensional posture of multiple persons are provided.

According to a first aspect of the present invention, there is provided a method for estimating a three-dimensional pose of a plurality of persons, comprising the steps of:

acquiring two-dimensional coordinates of a plurality of nodes on each human body of a plurality of human bodies to be processed in an image;

determining the depth of each human body in a corresponding scene in the image and the depth confidence of each human body in the corresponding scene in the image according to the two-dimensional coordinates of the nodes of each human body;

and adjusting the depths of a plurality of human bodies in the corresponding scenes in the image according to the depth confidence.

Further, the adjusting the depths of the plurality of human bodies in the corresponding scenes in the image according to the depth confidence includes:

selecting one of depth confidence coefficients of a plurality of human bodies in corresponding scenes in the image as a reference confidence coefficient according to a preset rule;

determining a depth adjustment value of each human body corresponding to the non-reference confidence coefficient in a corresponding scene in the image according to the characteristic vector of the human body multiple nodes corresponding to the reference confidence coefficient and the characteristic vector of each human body multiple node corresponding to the non-reference confidence coefficient;

and adjusting the depth of each human body in the corresponding scene in the image corresponding to the non-reference confidence degree according to the depth adjustment value of each human body in the corresponding scene in the image.

Further, after determining a depth confidence of each human in the corresponding scene in the image, the method further comprises: optimizing the depth confidence:

acquiring the theoretical depth of each human body;

determining the theoretical confidence of each human body in the corresponding scene in the image according to the theoretical depth and the depth of each human body in the corresponding scene in the image;

and optimizing the depth confidence of each human body in the corresponding scene in the image according to the theoretical confidence of each human body in the corresponding scene in the image to obtain the depth confidence of each optimized human body in the corresponding scene in the image.

and adjusting the depth of each human body in the corresponding scene in the image according to the depth confidence of each human body in the corresponding scene in the image after optimization, so as to obtain the depth of the human bodies in the corresponding scene in the image.

Further, the method further comprises:

determining a multi-person multi-node operation matrix according to the two-dimensional coordinates of a plurality of nodes on each human body, and acquiring a multi-person multi-node eigenvector matrix according to the operation matrix; wherein each node of each human body corresponds to one eigenvector in the eigenvector matrix;

determining a three-dimensional coordinate corresponding to each node of each human body according to the characteristic vector corresponding to each node of each human body;

determining the depth of each human body in the corresponding scene in the image and the depth confidence of each human body in the corresponding scene in the image according to the two-dimensional coordinates of the plurality of nodes of each human body, including:

determining the depth of each human body in a corresponding scene in the image according to the feature vector corresponding to each node of each human body;

and determining the depth confidence of each human body in the corresponding scene in the image according to the feature vector corresponding to each node of each human body.

According to a second aspect of the present invention, there is provided an apparatus for estimating a three-dimensional pose of a plurality of persons, comprising:

the first neural network module is used for acquiring two-dimensional coordinates of a plurality of nodes on each human body of a plurality of human bodies to be processed in the image;

the first determining module is used for determining the depth of each human body in a corresponding scene in the image according to the two-dimensional coordinates of the nodes of each human body;

a second determining module, configured to determine, according to the two-dimensional coordinates of the plurality of nodes of each human body, a depth confidence of each human body in a corresponding scene in the image

And the depth optimization module is used for adjusting the depths of a plurality of human bodies in corresponding scenes in the image according to the depth confidence coefficient.

Further, the depth optimization module is specifically configured to:

Further, the apparatus further comprises a confidence optimization module to:

acquiring the theoretical depth of each human body;

According to a third aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed, implements any of the above-described methods for estimating a three-dimensional pose of a plurality of persons.

According to a fourth aspect of the present invention, there is provided a computer device comprising a processor, a memory, and a computer program stored on the memory, wherein the processor implements the method for estimating the three-dimensional pose of the plurality of persons according to any one of the above methods when executing the computer program.

The method, the device, the medium and the equipment for estimating the three-dimensional postures of the multiple persons have the following technical effects: the depth of each person in the multi-person image in the same scene can be obtained, the depth is optimized and adjusted, recognition of the three-dimensional postures of the plurality of persons in the image is achieved, the three-dimensional posture estimation efficiency is better, the depth can be optimized through the confidence coefficient of the depth, and the accuracy of depth estimation is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a method provided by an exemplary embodiment of the present invention;

FIG. 2 is a flow chart of a method provided by another exemplary embodiment of the present invention;

FIG. 3 is a flow chart of a method provided by another exemplary embodiment of the present invention;

FIG. 4 is a flow chart of a method provided by another exemplary embodiment of the present invention;

FIG. 5 is a diagram of an operational matrix provided by an exemplary embodiment of the present invention;

FIG. 6 is a schematic diagram of human body node selection according to an exemplary embodiment of the present invention;

FIG. 7 is a schematic diagram of an apparatus provided by an exemplary embodiment of the present invention;

FIG. 8 is a schematic view of an apparatus provided by another exemplary embodiment of the present invention;

FIG. 9 is a schematic illustration of an apparatus provided in accordance with another exemplary embodiment of the present invention;

fig. 10 is a flow chart provided by an exemplary embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

As shown in fig. 1, a method for estimating a three-dimensional pose of multiple persons according to an embodiment of the present invention includes the following steps:

s101, acquiring two-dimensional coordinates of a plurality of nodes on each human body of a plurality of human bodies to be processed in the image.

In step S101, a color image including a plurality of persons is input to the first neural network module 600 (e.g., openpos neural network structure), and a multi-node two-dimensional coordinate matrix of a plurality of human bodies (each corresponding to two-dimensional coordinates of a plurality of nodes) is obtained. The number of nodes (joints) selected from each human body can be set according to actual needs, for example, each person obtains 2-17 nodes.

For example, the input image is an image containing 2 persons, each person selects three nodes, and the two-dimensional coordinates (two-dimensional pose points) of each node of each person can be obtained once through step S101, for example, the two-dimensional coordinate of the first node on the first person is (x)_1-1,y_1-1) The two-dimensional coordinate of the second node on the first human body is (x)_1-2,y_1-2) The two-dimensional coordinate of the third node on the first human body is (x)_1-3,y_1-3) The two-dimensional coordinate of the first node on the second human body is (x)_2-1,y_2-1) The two-dimensional coordinate of the second node on the first human body is (x)_2-2,y_2-2) The two-dimensional coordinate of the third node on the first human body is (x)_2-3,y_2-3)。

S102, according to the two-dimensional coordinates of the nodes of each human body, the depth of each human body in the corresponding scene in the image and the depth confidence of each human body in the corresponding scene in the image are determined.

In step S102, two-dimensional coordinates of each node of each person in the image (for example, a matrix formed by multi-person multi-node two-dimensional coordinates) are input into a second neural network module 700 (for example, a GCN graph convolution network structure, such as SemGCN) based on deep learning, a multi-person multi-node operation matrix in the GCN is determined according to the input two-dimensional coordinates, and a feature vector matrix of the multi-person is obtained according to the operation matrix; wherein each node of each human body corresponds to one eigenvector in the eigenvector matrix (e.g., a row vector of the matrix).

The feature vector matrix of multiple nodes of multiple persons is input to the first determining module 701 (which may be set as a first linear regression layer), and linear regression is performed, so that the feature vector corresponding to each node of each human body can be weighted and averaged, and the feature vector corresponding to each human body is obtained, thereby determining the depth of each human body in the corresponding scene in the image. The feature vector matrix of multiple nodes of multiple persons is input to the second determining module 702 (which may be set as a second linear regression layer), and linear regression is performed, so that the feature vector corresponding to each node of each human body can be weighted and averaged, and the feature vector corresponding to each human body is obtained, thereby determining the depth confidence of each human body in the corresponding scene in the image. In this embodiment, the determined depth and the depth confidence are both obtained by neural network estimation. Wherein, the two linear regression layers of the first determining module 701 and the second determining module 702 have different parameters.

Wherein, the depth of each human body in the corresponding scene in the image can be understood as the position of each human body in the scene or the distance between the human body in the image and the camera plane. It should be noted that the depth of each human body in the corresponding scene in the image obtained in this embodiment is substantially a root node depth estimated by the second neural network module 700 and the first determining module 701 through the network. In this embodiment, each person corresponds to a root node depth, and the root node depth that each person corresponds can be: the obtained depth of a root node is calculated by a linear regression weighted average (the first determining module 701) according to the feature vector corresponding to each node on the human body (for example, when the hip is taken as the root node, the calculated depth of the root node represents the depth of the hip of the person in the corresponding scene in the image). For example, when the input image is an image including 2 persons, a multi-person multi-node two-dimensional coordinate matrix is obtained through step S101; the depth of each person (e.g. the depth of the first person is d) can be obtained through step S102₁The depth of the second human body is d₂)。

In the human body posture estimation method in the related art, the depth is not always inspected or the reliability of obtaining the depth is not high. In step S102, a confidence may be further estimated corresponding to the depth, and the confidence may be understood as a probability that the estimated depth is correct, wherein a higher confidence indicates a smaller error of the obtained depth. The depth confidence of each human body in the corresponding scene in the image is calculated by a linear regression weighted average (the second determining module 702) according to the feature vectors corresponding to the multiple nodes of the human body.

Thus, each person corresponds to a depth in the corresponding scene in the image and a depth confidence in the corresponding scene in the image.

S103, adjusting the depth of the human bodies in the corresponding scene in the image according to the depth confidence coefficient.

And the depth is optimized according to the depth confidence coefficient, so that the reliability of the obtained depth is higher, and the accuracy of depth estimation is improved. For example, when the input image is an image including 2 persons, the depth of each person in the corresponding scene in the image and the depth confidence of each person in the corresponding scene in the image are obtained through step S102, and then the depth value of each person can be adjusted and optimized according to the depth confidence of each person through step S103, so as to obtain the depth with the depth confidence at a higher level.

The multi-person three-dimensional attitude estimation method provided by the embodiment of the invention can acquire the depth of each person in the same scene at one time, optimize and adjust the depth, realize the recognition of the multi-person attitude in the image, has better three-dimensional attitude estimation efficiency, can optimize the depth through the confidence coefficient of the depth, and improves the estimation accuracy.

In another exemplary embodiment, as shown in FIG. 2, the method includes the steps of:

s201, acquiring two-dimensional coordinates of a plurality of nodes on each human body of a plurality of human bodies to be processed in the image. See step S101 for details.

S202, according to the two-dimensional coordinates of the nodes of each human body, the depth of each human body in the corresponding scene in the image and the depth confidence of each human body in the corresponding scene in the image are determined. See step S102 for details.

S203, selecting one of the depth confidence coefficients of the human bodies in the corresponding scene in the image as a reference confidence coefficient according to a preset rule.

The reference confidence coefficient is selected in the following manner:

and comparing the depth confidence of each human body in the corresponding scene in the image, and selecting the depth confidence as a reference confidence. For example, in comparison between any two depth confidences, the higher depth confidence is the reference confidence, and the lower depth confidence is regarded as the non-reference confidence. Or the highest depth confidence in the historical data is taken as the reference confidence, and the confidence lower than the reference confidence can be regarded as the non-reference confidence.

For example, the depth confidence for each person of the output may be ranked in order from high to low. Specifically, when the input image is an image including 2 persons, the depth of the first person is d₁The depth of the second human body is d₂Confidence of depth of the first human body is C₁Confidence of depth of the second human body is C₂Suppose C in this example₁＞C₂. At this time, C can be selected₁As a reference confidence.

S204, determining a depth adjustment value of each human body corresponding to the non-reference confidence degree in a scene corresponding to the image according to the characteristic vectors of the human body multiple nodes corresponding to the reference confidence degree and the characteristic vectors of each human body multiple nodes corresponding to the non-reference confidence degree.

Inputting two-dimensional coordinates (such as a matrix formed by multi-person multi-node two-dimensional coordinates) of each node of each person in the image into a second neural network module 700 (such as a GCN graph convolution network structure, such as SemGCN) based on deep learning, determining a multi-person multi-node operation matrix in the GCN according to the input two-dimensional coordinates, and acquiring a multi-person multi-node feature vector matrix according to the operation matrix; wherein each node of each human corresponds to one eigenvector in the eigenvector matrix.

In step S204, the feature vectors of each node of the human body corresponding to the reference confidence and the feature vectors of each node of each human body corresponding to the non-reference confidence are subtracted or stacked together and input to the depth optimization module 703 (including a third linear regression layer), and the depth adjustment value is obtained through linear regression by the depth optimization module 703.

For example, when the input image is an image containing 2 persons, the depth d of the first person in the corresponding scene in the image, which is determined by the first determining module 701, is obtained₁Depth d of the second person in the corresponding scene in the image₂. Obtaining the depth confidence C of the first human body in the corresponding scene in the image determined by the second determining module 702₁Depth confidence C of the second human in the corresponding scene in the image₂. Suppose C in this example₁＞C₂Then use d₁Optimization of d₂Subtracting or stacking the eigenvector matrix of each node of the first human body and the eigenvector matrix of each node of the second human body, inputting the subtracted or stacked eigenvector matrices into a depth optimization module 703, and obtaining d by regression of the depth optimization module 703₁And d₂Depth adjustment value (depth difference).

S205, adjusting the depth of each human body in the corresponding scene in the image corresponding to the non-reference confidence degree according to the depth adjustment value of each human body in the corresponding scene in the image.

For example, according to d in connection with the example of step S204₁And d₂Determining the depth d of the adjusted and optimized second human body in the corresponding scene in the image according to the determined depth adjustment value₂’＝d₁+ depth difference, the output result of this step includes: depth d corresponding to the first human body with higher depth confidence₁And adjusting the optimized depth d corresponding to the second human body₂'. The optimization mode can be used for mining the information of the feature vector corresponding to the depth with high depth confidence coefficient, gaining the feature vector corresponding to the low confidence coefficient, and finally achieving the purpose of gaining the accuracy of the depth with low confidence coefficient.

In another exemplary embodiment, as shown in FIG. 3, the method includes the steps of:

s301, acquiring two-dimensional coordinates of a plurality of nodes on each human body of a plurality of human bodies to be processed in the image. See S101 for details.

S302, according to the two-dimensional coordinates of the nodes of each human body, the depth of each human body in the corresponding scene in the image and the depth confidence of each human body in the corresponding scene in the image are determined. See S102 for details.

And S303, optimizing the depth confidence.

In this step, the determined depth confidence is trained and optimized, and the following optimization method can be specifically adopted:

s3031, obtaining the theoretical depth of each human body.

In the process of determining the depth of each human body in the corresponding scene in the image, the gesture semantic information contained in the input two-dimensional coordinates is equivalently utilized, the theoretical depth of each human body in the image can be theoretically obtained according to the gesture semantic information and the size information, and for example, the theoretical depth corresponding to the two-dimensional coordinates can be inquired in a standard database. The size information refers to information such as the actual height of a person, or the size of a pixel occupied by each person in the fixed image; for example, in an image, if a first human body occupies half of the pixels and the real height is short, the depth of the first human body in the image may be small (the distance from the camera is short), and if a second human body occupies half of the pixels and the real height is high, the depth of the second human body in the image may be large (the distance from the camera is long).

S3032, according to the theoretical depth and the depth of each human body in the corresponding scene in the image, determining the theoretical confidence of each human body in the corresponding scene in the image.

Inquiring standard data to obtain theoretical depth corresponding to each person

Then, at the theoretical depth

For mean, custom variance (e.g., can be 30), establish a one-dimensional gaussian distribution with respect to depth:

(d) wherein,

the abscissa of the gaussian distribution function is depth, and the ordinate is theoretical confidence corresponding to the depth.

d is the depth of each human body in the corresponding scene in the image determined according to the input two-dimensional coordinates (the estimated depth determined by the first determining module 701 can be directly obtained). And substituting the depth of each person in the corresponding scene in the image into the Gaussian distribution, wherein the obtained result C is the theoretical confidence corresponding to each person. Therefore, the theoretical confidence of the depth of each human body estimated by the network in the corresponding scene in the image can be determined through the gaussian distribution and the depth of each human body estimated by the first determining module 701 in the corresponding scene in the image. In this way, the depth confidence determined by the second determining module 702 can be trained and optimized according to the theoretical confidence, so that the confidence of the confidence network estimation is more accurate. The confidence corresponding to the theoretical depth is considered to be 1, and the confidence calculated by the Gaussian distribution is gradually decreased towards two sides in a Gaussian distribution manner.

S3033, training and optimizing the depth confidence coefficient of each human body in the corresponding scene in the image according to the theoretical confidence coefficient of each human body in the corresponding scene in the image, and obtaining the depth confidence coefficient of each optimized human body in the corresponding scene in the image.

The two-dimensional coordinates are input into a second neural network module 700 (such as a GCN graph convolution network structure, for example, SemGCN) based on deep learning, a multi-person multi-node operation matrix in the GCN is determined according to the two-dimensional coordinates of a plurality of nodes on each human body, and a multi-person multi-node eigenvector matrix is obtained according to the operation matrix, wherein each node of each human body corresponds to one eigenvector in the eigenvector matrix.

The feature vector matrices corresponding to the multiple nodes of the person are input to the second determining module 702 (including the second linear regression layer and the Sigmoid function module), and the depth confidence of each human body in the corresponding scene in the image is obtained through the linear regression weighted average.

After a set of "depth-theoretical confidence" data pairs of the human body in the corresponding scene in the image is obtained in step S3032, in step S3033, the depth confidence is optimized through theoretical confidence training. For example, a theoretical confidence C and a depth confidence C of a scene corresponding to a human body in an image are established_iAccording to the depth confidence C_iThe confidence loss for each person is calculated separately from the theoretical confidence C: loss ═ C (C-C)_i)²。

For each human bodyThe theoretical confidence C in the corresponding scene in the image is taken as the depth confidence C of each human body in the corresponding scene in the image_iAnd (4) training a network according to the actual value corresponding to the task. In the training process, the depth-theoretical confidence coefficient data pair is used as a training sample, and the loss is trained (to be smaller and smaller), so that the depth confidence coefficient of each human body in the corresponding scene of the image, corresponding to the input two-dimensional coordinates, can be obtained more accurately.

S304, according to the depth confidence of each optimized human body in the corresponding scene in the image, adjusting the depth of the corresponding human body in the corresponding scene in the image, and obtaining the depth of the multiple human bodies in the corresponding scene in the image. Reference may be made in particular to the corresponding steps of the embodiment of fig. 2.

In this embodiment, the second determining module 702 for determining the depth confidence of each human body in the corresponding scene in the image is disposed after the second neural network module 700, and the depth confidence may also represent the confidence of the feature vector output by the previous network outputting the depth, that is, the second neural network module 700. In the optimization training of the depth confidence, trainable parameters involved in the process of acquiring data by the first determining module 701 are temporarily frozen, and only the depth confidence network of the second determining module 702 is trained, so as to mine more accurate depth confidence. The depth confidence coefficient can also express the robustness of the features corresponding to the estimated depth, and high confidence coefficient means the accuracy of depth estimation.

In another exemplary embodiment, as shown in FIG. 4, the method includes the steps of:

step S401, acquiring two-dimensional coordinates of a plurality of nodes on each human body of a plurality of human bodies to be processed in the image. See step S101 for details.

S402, determining a multi-person multi-node operation matrix according to two-dimensional coordinates of a plurality of nodes on each human body, and acquiring a characteristic vector matrix of a plurality of persons according to the operation matrix; wherein each node of each human corresponds to one eigenvector in the eigenvector matrix.

The two-dimensional coordinates (which may be a matrix formed by multi-person multi-node two-dimensional coordinates) of each node of each person in the image are input into a second neural network module 700 (such as a GCN graph convolution network structure, for example, SemGCN) based on deep learning, a multi-person multi-node operation matrix in the GCN is determined, and a multi-person multi-node feature vector matrix is obtained according to the operation matrix.

And S403, determining a three-dimensional coordinate corresponding to each node of each human body according to the feature vector corresponding to each node of each human body.

In step S403, a matrix formed by the feature vectors corresponding to the multiple nodes is input to the third determining module 704 (which may be set as a fourth linear regression layer), and linear regression is performed, so that three-dimensional coordinates corresponding to each node of each human body can be determined according to the feature vector corresponding to each node of each human body. For example, when the input image is an image containing 2 persons, a node two-dimensional coordinate matrix of a plurality of persons is obtained; through step S403, a three-dimensional coordinate matrix corresponding to multiple nodes of multiple persons can be obtained (thereby obtaining coordinates of each three-dimensional posture point of each person).

S404, determining the depth of each human body in a corresponding scene in the image according to the feature vector corresponding to each node of each human body; and determining the depth confidence of each human body in the corresponding scene in the image according to the feature vector corresponding to each node of each human body.

The eigenvector matrix corresponding to multiple nodes of multiple persons is input to the first determining module 701, and the depth eigenvector or the matrix formed by the depth values corresponding to multiple persons is obtained through linear regression weighted average. Here, a matrix of N × 1 may be obtained through two linear regressions, for example, a first linear regression based on a feature vector matrix corresponding to multiple nodes of multiple persons, where N may be understood as the number of human bodies in an image × the number of nodes selected per human body (for example, N ═ 6 in fig. 5); and then the second linear regression is converted into a matrix of a 1, wherein a represents the number of people (for example, a is 2 in fig. 5). In the resulting matrix, each person corresponds to a depth feature vector or depth value.

Similarly, the feature vector matrix corresponding to multiple nodes of multiple persons is input to the second determining module 702, and the depth confidence feature vector corresponding to multiple persons or the matrix formed by the depth confidence values is obtained through two times of linear regression weighted averaging.

And S405, adjusting the depths of a plurality of human bodies in corresponding scenes in the image according to the depth confidence. See in particular the method in the corresponding embodiment of fig. 2.

In step S402, the two-dimensional coordinates of multiple nodes obtained in step S401 are input to the second neural network module 700 based on deep learning, and a multiple-node operation matrix in the second neural network module 700 is determined, where the operation matrix (a) ═ connection matrix (B) × two-dimensional coordinate matrix (C), where the connection matrix may represent weight parameters of a node (such as parameters of diagonal positions of the connection matrix in fig. 5) and connection weight parameters of the node (such as shown in fig. 6, which is a schematic diagram of selection of a human body node, where nodes connected or adjacent to a node a have B, C, and d, nodes connected or adjacent between nodes have connection weights involved in network calculation), and the weight parameters may be optimized through learning; the dimension of the connection matrix is N x N. The two-dimensional coordinate matrix is an input two-dimensional coordinate matrix containing multiple nodes of multiple persons. The result of the operation matrix is a feature matrix formed by the feature vectors of a plurality of persons.

In this embodiment, for an image including multiple persons, two-dimensional coordinates of nodes corresponding to the multiple persons are simultaneously input to the second neural network module 700, and a feature matrix formed by feature vectors of the multiple persons can be obtained through one-time operation matrix operation, thereby greatly improving the speed.

For example, in a specific example, when the input image is an image including 2 persons, each person selects 3 nodes, N is equal to 6, the dimension of the connection matrix B is 6 × 6, the dimension of the two-dimensional coordinate matrix C is 6 × 2, and the dimension of the feature matrix a is 6 × 2. As shown in fig. 5, each row of the connection matrix from top to bottom is: first node J of first person_1-1A first node J of a second person_2-1Second node J of the first person_1-2A second node J of a second person_2-2… …, each column of the connection matrix from left to right is: first node J of first person_1-1A first node J of a second person_2-1The first stepSecond node J of one person_1-2A second node J of a second person_2-2… … connection matrix_ijAnd indicates whether there is a connection or adjacency between two nodes. When A is_ijW represents that the node in the ith row in the connection matrix is connected or adjacent to the node in the jth column, for example, w except for the diagonal positions in the connection matrix in fig. 5 expresses the weight parameter between adjacent or connected nodes; or A_ijExpressing a weight parameter of a certain node as w; the parameter w connecting diagonal positions of the matrix in fig. 5 expresses the weight parameter of the corresponding node. Where w is a weight parameter that can be optimized by training, which is only illustrated in fig. 5, and the values of w may be substantially different. When A is_ij0 means that the nodes in i rows are not connected to the nodes in j columns in the connection matrix. The two-dimensional coordinate matrix C is a corresponding two-dimensional coordinate of the node. In this embodiment, each human body node in the feature vector matrix a corresponds to a feature vector, for example, as shown in fig. 5, a_1-1、b_1-1The row represents a feature vector (here, a two-dimensional vector is taken as an example) corresponding to a first node of the first human body, a_2-1、b_2-1The row represents the feature vector corresponding to the first node of the second human body.

For another example, in other embodiments, the dimension of the feature matrix is N × M, where N is the number of human bodies in the image and the number of nodes selected for each human body, and M may be a custom dimension, such as setting M to 256 dimensions.

In this embodiment, the second neural network module 700 further includes a convolution submodule 7001 and a plurality of loop submodules 7002, and as shown in fig. 9, the input of each loop submodule 7002 includes two parts: i.e. the output of the previous sub-module and the input of the previous sub-module. In this embodiment, before each graph convolution operation, the dimension (value of M) of the feature matrix input this time needs to be converted, and the dimension may be increased, decreased or increased. Generally, the more dimensions, the better the accuracy, but the slower the operation speed. The dimension conversion can be performed as follows: for each person, the feature matrix operation, for example, in the convolution sub-module, represents the feature matrix dimension corresponding to a person as: (N/person count) × M1, converting (N/person count) × M1 into a matrix of (N/person count) × M2 by using a linear regression layer, converting the characteristics of each person, merging the matrices of N persons to form a total characteristic matrix, and performing the next graph convolution operation. Wherein, M1 and M2 can be both customized numerical values.

In the above embodiment, the first determining module 701 (which may be set as a first linear regression layer) for determining the depth confidence of each human body in the corresponding scene in the image, the second determining module 702 (which may be set as a second linear regression layer) for determining the depth confidence of each human body in the corresponding scene in the image, the depth optimizing module 703 (which may be set as a third linear regression layer) for adjusting the depth, and the third determining module 704 (which may be set as a fourth linear regression layer) for determining the three-dimensional coordinates are set in parallel after the second neural network module 700, and the input data of the four linear regression modules set in parallel are all obtained from the output data (multi-person multi-node feature vector matrix) after graph convolution based on the two-dimensional coordinates. For example, as shown in fig. 10, after the two-dimensional coordinates are input, the depth of each human body in the corresponding scene in the image, the depth confidence of each human body in the corresponding scene in the image, and the three-dimensional coordinates of each node of each human body may be obtained, where the depth may be optimized and adjusted by a depth optimization module 703, and the depth confidence may be adjusted by a confidence optimization module 705.

It is worth noting that the parameters of several linear regression layers arranged behind the second neural network module 700 are different. And data may be transmitted between the modules, for example, the second determining module 702 may obtain the depth data of the first determining module 701, and the depth optimizing module 703 may obtain the depth data of the first determining module 701 and the depth confidence data of the second determining module 702.

Another embodiment of the present invention provides an apparatus for estimating a three-dimensional pose of multiple persons, as shown in fig. 7, the apparatus in this embodiment includes: the first neural network module 600, the first determining module 701, the second determining module 702, and the depth optimizing module 703 further include a second neural network module 700, and the apparatus of this embodiment is used to implement the method shown in fig. 1. The first neural network module 600 is configured to obtain two-dimensional coordinates of a plurality of nodes on each of a plurality of human bodies to be processed in an image. A first determining module 701, configured to determine, according to the two-dimensional coordinates of the multiple nodes of each human body, a depth of each human body in a corresponding scene in the image. A second determining module 702, configured to determine, according to the two-dimensional coordinates of the multiple nodes of each human body, a depth confidence of each human body in a corresponding scene in the image. And a depth optimization module 703, configured to adjust depths of multiple human bodies in a corresponding scene in the image according to the depth confidence.

In another embodiment, as shown in fig. 7, the apparatus in this embodiment comprises: the first neural network module 600, the first determining module 701, the second determining module 702, and the depth optimizing module 703 further include a second neural network module 700, and the apparatus of this embodiment is used to implement the method shown in fig. 2. The depth optimization module 703 is specifically configured to: selecting one of depth confidence coefficients of a plurality of human bodies in corresponding scenes in the image as a reference confidence coefficient according to a preset rule; determining a depth adjustment value of each human body corresponding to the non-reference confidence coefficient in a corresponding scene in the image according to the characteristic vector of the human body multiple nodes corresponding to the reference confidence coefficient and the characteristic vector of each human body multiple node corresponding to the non-reference confidence coefficient; and adjusting the depth of each human body in the corresponding scene in the image corresponding to the non-reference confidence degree according to the depth adjustment value of each human body in the corresponding scene in the image.

In another embodiment, as shown in fig. 8, the apparatus in this embodiment comprises: the first neural network module 600, the first determining module 701, the second determining module 702, and the depth optimizing module 703 further include a second neural network module 700 and a confidence optimizing module 705, and the apparatus of the present embodiment is used to implement the method shown in fig. 3. The confidence optimization module 705 is configured to: acquiring the theoretical depth of each human body; determining the theoretical confidence of each human body in the corresponding scene in the image according to the theoretical depth and the depth of each human body in the corresponding scene in the image; and optimizing the depth confidence of each human body in the corresponding scene in the image according to the theoretical confidence of each human body in the corresponding scene in the image to obtain the depth confidence of each optimized human body in the corresponding scene in the image.

In another embodiment, still referring to fig. 8, the apparatus in this embodiment includes: the first neural network module 600, the first determining module 701, the second determining module 702, the depth optimizing module 703 and the confidence optimizing module 705 further include a second neural network module 700 and a third determining module 704, and the apparatus of the embodiment is used for implementing the method shown in fig. 4. The second neural network module 700 is configured to determine a multi-user multi-node operation matrix according to two-dimensional coordinates of a plurality of nodes on each human body, and obtain a multi-user multi-node eigenvector matrix according to the operation matrix; wherein each node of each human corresponds to one eigenvector in the eigenvector matrix. A third determining module 704, configured to determine a three-dimensional coordinate corresponding to each node of each human body according to the feature vector corresponding to each node of each human body.

Wherein the second neural network module 700 includes: the convolution submodule 7001 and the plurality of circulation submodules 7002, the convolution submodule 7001 may adopt a GCN structure, the circulation submodule 7002 includes two GCN structures connected in series, and each GCN structure may include a graph convolution pooling layer, a regression layer (Batchnorm), a nonlinear layer (Relu layer), a residual network layer and other common network layers of a convolution network.

Another embodiment of the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed, performs the steps of the method in the above-described embodiments.

Another embodiment of the present invention provides a computer device, which includes a processor, a memory, and a computer program stored in the memory, wherein the processor implements the steps of the method in the above embodiments when executing the computer program.

The above-described aspects may be implemented individually or in various combinations, and such variations are within the scope of the present invention.

It is to be noted that, in this document, the terms "comprises", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, so that an article or apparatus including a series of elements includes not only those elements but also other elements not explicitly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of additional like elements in the article or device comprising the element.

The above embodiments are merely to illustrate the technical solutions of the present invention and not to limit the present invention, and the present invention has been described in detail with reference to the preferred embodiments. It will be understood by those skilled in the art that various modifications and equivalent arrangements may be made without departing from the spirit and scope of the present invention and it should be understood that the present invention is to be covered by the appended claims.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, apparatuses, functional modules/units in the apparatuses disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method for estimating the three-dimensional posture of a plurality of people is characterized by comprising the following steps:

2. The method for estimating the three-dimensional pose of multiple persons according to claim 1, wherein said adjusting the depth of multiple persons in the corresponding scene in the image according to the depth confidence comprises:

3. The method of estimating a three-dimensional pose of a plurality of persons according to claim 1, wherein after determining a depth confidence of each person in a corresponding scene in said image, said method further comprises: optimizing the depth confidence:

acquiring the theoretical depth of each human body;

4. The method of claim 3, wherein the adjusting the depths of the plurality of human bodies in the corresponding scene in the image according to the depth confidence comprises:

5. The multi-person three-dimensional pose estimation method according to any one of claims 1 to 4, further comprising:

6. An apparatus for estimating a three-dimensional pose of a plurality of persons, comprising:

the second determining module is used for determining the depth confidence of each human body in the corresponding scene in the image according to the two-dimensional coordinates of the nodes of each human body;

7. The apparatus according to claim 6, wherein the depth optimization module is specifically configured to:

8. The multi-person three-dimensional pose estimation apparatus of claim 6, further comprising a confidence optimization module, said confidence optimization module configured to:

acquiring the theoretical depth of each human body;

9. A computer-readable storage medium having stored thereon a computer program which, when executed, implements the method of estimating a three-dimensional pose of a plurality of persons as recited in any one of claims 1-5.

10. A computer device comprising a processor, a memory and a computer program stored on the memory, the processor implementing the method of estimating a three-dimensional pose of a plurality of persons as claimed in any one of claims 1 to 5 when executing the computer program.