CN111797714B

CN111797714B - Multi-view human motion capture method based on key point clustering

Info

Publication number: CN111797714B
Application number: CN202010546913.9A
Authority: CN
Inventors: 周晓巍; 鲍虎军; 帅青; 方琦
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2022-04-26
Anticipated expiration: 2040-06-16
Also published as: CN111797714A

Abstract

The invention discloses a multi-view human body motion capture method based on key point clustering. The implementation of the invention comprises: detecting human body key points in the multi-viewpoint image; establishing a relation graph of the human body key points based on the image information and the geometric relation; based on a relationship graph and a clustering algorithm of key points, correlating key points of a human body under each viewpoint to find key points belonging to the same person; and restoring the three-dimensional positions of the key points of the human body based on the correlation results of the key points. According to the method, the association of the human body key points in the images with different viewpoints is established through a high-efficiency robust key point clustering method, and the efficiency and the accuracy of the multi-viewpoint human body motion capture method are improved.

Description

Multi-view human motion capture method based on key point clustering

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a multi-view human motion capture method based on key point clustering.

Background

In the related multi-viewpoint video markerless multi-person reconstruction technology, some methods firstly detect the human body by using a top-down method and then match the human body under each viewpoint, and the method is poor in performance when the human body in a scene is shielded. Recently, there is also a method of reconstructing each bone segment by associating each bone segment of a human body, and combining the bone segments belonging to the same person. This method does not take into account the geometric information of the human body sufficiently, making the correlation process less efficient.

Disclosure of Invention

The invention aims to provide a multi-view human body motion capture method based on key point clustering and perform three-dimensional human body reconstruction by using a clustering result aiming at the defects of the prior art.

According to a first aspect of the present invention, there is provided a method for establishing a graph model for human key points, comprising:

1. establishing a graph model: the invention uses a graph model to model the human body key point clustering problem. The detected key points of the human body under each viewpoint are regarded as vertexes in the graph model, and the connection between the key points is regarded as the edges of the graph model. One human body can be represented by a subgraph of the graph, and one subgraph comprises a plurality of vertexes. A sub-graph represents a person in its entirety and only if the sub-graph contains all vertices belonging to the person, i.e. the sub-graph contains all detected keypoints of the person at the respective viewpoint. The human body key point clustering problem can be described as the problem of finding a complete human body subgraph in the graph. The key point clustering problem in a multi-person scene is to find all such eligible subgraphs.

2. Construction of graph model vertices: and (3) giving a keypoint response heat map generated by the image of multiple viewpoints, and obtaining candidate positions of the human body keypoints of various types on the image by using a non-maximum suppression method. And establishing the graph model by taking all key points obtained under all viewpoints as the top points of the graph model.

3. Construction of graph model edge: given the vertex of the graph model, the image information and the geometric information are comprehensively used for constructing a weighted edge between the vertices, and the weighted value of the edge represents the distance between two vertices connected by the edge. That is, for any two vertices, the closer the distance is to zero, the more the two vertices tend to belong to the same person; the larger the distance, the less likely it is that the two vertices belong to the same person. For two key points under the same viewpoint, if the two key points are not adjacent on the human skeleton tree, the distance between the two vertexes is infinite, and if the two key points are adjacent on the human skeleton tree, the distance is obtained by the image confidence coefficient of the connection of the two key points in the image. For two key points which are not in the same viewpoint, if the two key points represent joints which are not of the same type of human body, the distance is infinite; if the same type of joint of the human body is represented, the distance is obtained through space geometric information, and the specific calculation mode is as follows: respectively back projecting the two-dimensional key points to the space to form two rays, and calculating the shortest distance between the two rays.

According to a second aspect of the present invention, there is provided a clustering method for graph models of human body key points, comprising:

1. vertex association: and (4) giving an initialized graph model, and clustering on the human body key point graph provided by the invention. Given the vertices and edges of the graph, the distances of all edges are sorted in ascending order. According to the sorting of the edges, from the edge with the minimum distance, the top points which are not connected are sequentially selected for connection, and the two top points connected by the selected edge are combined into a subgraph until the distance of the edge is larger than a given maximum distance threshold value.

2. And (3) subgraph association: after the vertices are connected, some of the vertices have been connected into a subgraph containing multiple vertices. Different from a general graph clustering method, the distance between sub-graphs is used in the clustering process, and the distance between vertexes is not only considered. The vertices in the first step can be considered as a subgraph containing only one vertex. The distance between two subgraphs is calculated as the mean of all distances between the vertices contained in the two subgraphs. After the distances are calculated, the distances of all the edges are arranged in an ascending order, and the edges are selected in sequence for connection.

3. And (3) iterative updating: the clustering process of the whole graph is therefore divided into two processes of distance calculation and sub-graph association. And after the distance is calculated, connecting the sub-graphs, and recalculating the distance of the connected sub-graphs. In an iterative process, the clustering process is more robust than joining taking into account only the distance between individual keypoints, since more keypoints are used in calculating the distance. The iterative process ends until no more new edges can be connected.

According to a third aspect of the present invention, there is provided a reconstruction tracking method, comprising:

1. and (3) reconstructing a clustering result: and reconstructing key points of each clustered human body, which comprise two or more viewpoints, to obtain the position of the three-dimensional key point.

2. And tracking a reconstruction result: and regarding the reconstruction result of the previous frame as a subgraph only containing three-dimensional key points, and adding the subgraph set of the graph model of the current frame. And when the distance between the subgraph of the three-dimensional key point and the subgraph of the two-dimensional key point of the current frame is calculated, the calculation mode is that the distance between the three-dimensional key point and the two-dimensional key point is back projected to a straight line formed in space. The subgraph formed by the three-dimensional key points can be connected with other subgraphs to complete the clustering process.

The invention has the beneficial effects that: the invention carries out the reconstruction of the three-dimensional posture of the human body by correlating key points of the human body in the multi-viewpoint images. In order to realize the association of key points, the invention provides a method for establishing points and edges of a graph model and simultaneously provides an iterative clustering method combining image information, geometric information and time sequence information. The method realizes multi-viewpoint human body motion capture and improves the accuracy of the image-based motion capture method. Compared with the existing method based on key point clustering, the method simultaneously completes two processes of key point association and human body structure organization, and considers human body structure information in the clustering process, so that the clustering result is more robust.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic view of a multi-view human motion capture process according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating the vertices and edges of an initialized constructor model according to an embodiment of the invention.

Fig. 3 is a schematic diagram of a clustering result in an iterative process according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

As shown in fig. 1, the present invention provides a multi-view human motion capture method based on key point clustering, which performs three-dimensional key point reconstruction based on key point positions obtained from images. The method comprises the following specific steps:

1. the images synchronously acquired by a plurality of cameras with different viewpoints are input, and a distribution heat map of the positions of key points under each viewpoint can be obtained by using an existing deep neural network. Extracting the two-dimensional positions of possible human key points in the image by using a non-maximum value inhibition method according to the key point position distribution heat map, namely for the c-th camera and for the j-th joint of the human body, the two-dimensional position of the m-th key point detected in the image can be recorded as

Where J is 1,2, …, J denotes J different joints of the human body, and c is 1,2, …, V denotes V cameras in total. As shown in fig. 1 (a).

2. For the key points detected from the image, calculating the m detection result d of the i joint under the u viewpoint_uimAnd the nth detection result d of the j joint under the v viewpoint_vjnAnd establishing a graph model of the key points of the human body. At this time, each vertex on the graph model represents a key point, and the edges on the graph model represent the distance of the key points defined by the method. Because the human skeleton has a specific connection mode, for example, the left wrist is only connected with the left elbow, we only need to consider the combination of the connection relationship, and to record all possible joint types as P, and in the key points under the same viewpoint, the joint types of two key points which are possibly connected must be in possible connectionThe joint type combination, i.e. (i, j) ∈ P. Then the classification considers the case where the distance needs to be calculated:

if u is v, (i, j) e P represents a combination of key points possibly in the same bone segment from the same viewpoint, based on the existing deep neural network, the confidence of the connection of the two key points can be obtained from the image, and then the distance between the two nodes at this time is defined as the inverse of the confidence. The greater the confidence, the smaller the distance between the two nodes, indicating that the two nodes are more easily connected. In the upper left corner of fig. 2, the two-dimensional key point combinations connecting this segment of the human spine are depicted.

If u ≠ v, i ═ j, then it represents the keypoints of the same joint type at different viewpoints, where the distance of two keypoints is defined as the shortest distance between the two keypoints back-projected onto the ray formed in space. In the upper left and lower left figures of fig. 2, the joint of the neck of a person is taken as an example, and the side connecting the existence of the neck joint from the upper and lower viewpoints is drawn.

3. Clustering the key points of a plurality of people under multiple viewpoints by using the clustering algorithm provided by the invention, wherein each clustering result is a subgraph which comprises at least one key point, and for the ith subgraph, the notation is C_i＝{d_cjm…, i represents the sequence number of the person corresponding to the sub-graph. The result of the t step in the iterative process is the current subgraph set and is marked as S^t。

3.0. Initializing clusters: in the initialization process, the graph model establishing method provided by the invention is used for obtaining the nodes and edges of the initial graph. Taking each node as a subgraph to obtain an initial subgraph set S⁰＝{C₁,C₂,…,C_NAnd N subgraphs are contained in the set in total. Each keypoint will only appear in one subgraph.

3.1. And (3) distance calculation: in the process of iteration of the t step, a subgraph set S is required^tThe distance between any two sub-graphs in (1) is calculated. For subgraphs

The distance calculation formula is as follows:

wherein the content of the first and second substances,

representation scheme C₁The xth node in (1), the node is the u_xUnder each viewpoint, i_xM th of the seed joint_x(ii) a detection result;

representation scheme C₂The y-th node in (1) is the v-th node_yFrom individual viewpoint, j_yN th of seed joint_y(ii) a detection result; d (-) represents the distance calculation of the above-mentioned key points, D_sAnd (c) representing a distance computation function between the subgraphs.

3.2. Sub-graph connection: for the sub-graph set S of the t step^tThe distance between any two subgraphs forms a list D^t＝{D_s(C₁,C₂),D_s(C₁,C₃) …, the list is sorted. The subgraph connection process is as follows:

3.2.0: initializing empty sets S^t+1Representing a new set of subgraphs; the initialized empty set Q ═ denotes the set of used subgraphs.

3.2.1: selection D^tEdge D with minimum middle distance_s(C_i,C_j) The edge is driven from D^tIs removed.

3.2.2: judgment C_i,C_jIf one is in set Q, then go back to 3.2.1; if neither is present, then C_i,C_jAdd to the set Q.

3.2.3: determining the distance D_s(C_i,C_j) Whether or not it is greater than a given maximum distance threshold D_maxAnd if so, ending the connection process.

3.2.4: sub-diagram C_i,C_jMerge into a new subgraph C_kAdding new subgraphs to the set S^t+1. (ii) a And returning to the step 3.2.1, and continuing to connect.

The subgraph connection process is repeatedly executed until the set D^tDoes not contain any edges or cannot be connected to new edges. After the subgraph connection process is finished, the newly connected subgraph set S is collected^t+1When a new subgraph is taken, step 3.1 is skipped and distance calculation is carried out again.

Fig. 3 shows two subgraphs of one of the steps in the iterative process. The keypoints detected in the upper two views are contained in one sub-graph and the keypoints in the lower two views are contained in the other sub-graph. The distance between the two sub-graphs is used for connection, and more key points are considered, so that the distance calculation is more stable, and the robustness of the clustering process can be effectively improved.

The subgraph connection process is repeated until no more new edges can be connected. The clustering result is shown in FIG. 1 (b).

4. For the key points completing clustering, a key point reconstruction algorithm is used to obtain the positions of the three-dimensional human body key points, and the reconstruction result is shown in fig. 3. The reconstructed three-dimensional human key points are re-projected into the images of the respective viewpoints, as shown in fig. 1 (c).

5. The clustering subgraph of the previous frame is

Including the three-dimensional key point position of the human body reconstructed from the previous frame. In order to comprehensively use time sequence information, the method adds the subgraph of the previous frame into the subgraph set S initialized by the current frame in a 3.0 step when the subgraph set of the previous frame is non-empty⁰In (1). In the distance calculation step of 3.1, if one sub-graph is a sub-graph containing three-dimensional key points and one sub-graph is a sub-graph containing two-dimensional key points, the distance is calculated as the distance from the three-dimensional key points to the ray formed by back projection of the two-dimensional key points. The remaining clustering process is consistent with the previous one.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points.

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A multi-view human motion capture method based on key point clustering is characterized by comprising the following steps:

s1: detecting two-dimensional positions of human body key points of each joint type in the multi-viewpoint image;

s2: all key points obtained under all viewpoints are used as vertexes of the graph model, edges with weights are constructed between the vertexes by integrating the image information and the geometric information, and the weights of the edges represent the distance between two vertexes connected with the edges to establish the graph model;

for two key points under the same viewpoint, if the two key points are not adjacent on the human skeleton tree, the distance is infinite, and if the two key points are adjacent on the human skeleton tree, the distance is defined as the opposite number of the confidence coefficient of the connection of the two key points in the image;

for two key points which are not under the same viewpoint, if the two key points represent joints of the same type of human body, the distance between the two key points is infinite, and if the two key points represent joints of the same type of human body, the distance between the two key points is defined as the shortest distance between the two key points which are back-projected to a ray formed in the space;

s3: clustering key points of a plurality of people under multiple viewpoints, wherein each clustering result is a subgraph which comprises at least one key point, and the ith subgraph is marked as C_i(ii) a The result of the t step in the iterative process is the current subgraph set and is marked as S^t；

3.0. Initializing clusters: obtaining nodes and edges of an initial graph; taking each node as a subgraph to obtain an initial subgraph set S⁰＝{C₁，C₂，...，C_NN is the total number of neutron maps in the set;

3.1. and (3) distance calculation: in the process of the t step iteration, a sub-atlas set S^tCalculating the distance between any two subgraphs; for subgraphs

The distance calculation formula is as follows:

wherein the content of the first and second substances,

representation scheme C₂The y-th node in (1) is the v-th node_yFrom individual viewpoint, j_yN th of seed joint_y(ii) a detection result; d (-) represents the distance calculation of the keypoint, D_s() represents a distance computation function between subgraphs;

3.2. sub-graph connection: for the sub-graph set S of the t step^tThe distance between any two subgraphs forms a list D^t＝{D_s(C₁，C₂)，D_s(C₁，C₃) ,.. }, sorting the list; the subgraph connection process is as follows:

3.2.0: initializing empty sets S^t+1Representing a new set of subgraphs; initializing an empty set Q phi which represents a set of used subgraphs;

3.2.1: selection D^tEdge D with minimum middle distance_s(C_i，C_j) The edge is driven from D^tRemoving;

3.2.2: judgment C_i，C_jIf one is in the set Q, returning to the step 3.2.1; if neither is present, then C_i，C_jAdding the obtained mixture into a set Q;

3.2.3: determining the distance D_s(C_i,C_j) Whether or not it is greater than a given maximum distance threshold D_maxIf yes, ending the connection process;

3.2.4: sub-diagram C_i,C_jMerge into a new subgraph C_kAdding new subgraphs to the set S^t+1(ii) a Returning to the step 3.2.1, and continuing to connect;

the subgraph connection process is repeatedly executed until the set D^tDoes not contain any edge, or cannot be connected with a new edge; after the subgraph connection process is finished, the newly connected subgraph set S is collected^t+1When the new subgraph is taken, jumping to the step 3.1, and calculating the distance again;

s4: and for the key points after clustering, obtaining the positions of the three-dimensional human body key points by using a key point reconstruction algorithm, and re-projecting the reconstructed three-dimensional human body key points into the images of all the viewpoints.

2. The method for multi-view human motion capture based on key point clustering of claim 1, wherein in step S3, the clustering subgraph of the previous frame is taken as

The three-dimensional key point position of the human body reconstructed from the previous frame is included; when the sub-picture set of the previous frame is non-empty, in step 3.0, the sub-picture of the previous frame is added to the sub-picture set S initialized by the current frame⁰Performing the following steps; in step 3.1, if one sub-graph is a sub-graph containing three-dimensional keypoints and one sub-graph is a sub-graph containing two-dimensional keypoints, the distance between the three-dimensional keypoints and the ray formed by back projection of the two-dimensional keypoints is calculated.