CN110796699B

CN110796699B - Optimal view angle selection method and three-dimensional human skeleton detection method for multi-view camera system

Info

Publication number: CN110796699B
Application number: CN201910524334.1A
Authority: CN
Inventors: 李玉玮; 罗曦
Original assignee: SHANGHAI CRIMINAL SCIENCE TECHNOLOGY RESEARCH INSTITUTE; Plex VR Digital Technology Shanghai Co Ltd
Current assignee: SHANGHAI CRIMINAL SCIENCE TECHNOLOGY RESEARCH INSTITUTE; Plex VR Digital Technology Shanghai Co Ltd
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2024-03-01
Anticipated expiration: 2039-06-18
Also published as: CN110796699A

Abstract

The optimal visual angle selection method of the multi-view camera system and the three-dimensional human skeleton detection method comprise the following steps: the multi-view camera system obtains two-dimensional key points, and key point detection and a confidence level are correspondingly arranged for each camera view angle; the key point is not visible in the camera view angle and is 0; the visible value is 1; and establishing an initial optimal camera set as an initialization, and taking a camera view angle set corresponding to the key points with the confidence coefficient larger than or equal to the threshold value gamma as an initial optimal view angle combination. The three-dimensional human skeleton detection method comprises the steps of acquiring data to obtain two-dimensional key points and corresponding confidence degrees of human pictures; selecting an optimal camera view angle for the two-dimensional key points; three-dimensional skeletal points are generated. The invention can simply, conveniently and quickly select the optimal view angle combination from the multi-view camera system to triangulate, thereby obtaining more accurate three-dimensional human bones and improving the robustness and accuracy of view angle selection.

Description

Optimal view angle selection method and three-dimensional human skeleton detection method for multi-view camera system

Technical Field

The invention relates to the technical field of three-dimensional graphics, in particular to an optimal visual angle selection method and a three-dimensional human skeleton detection method of a multi-view camera system.

Background

In the field of three-dimensional reconstruction, commonly used input devices are monocular cameras, monocular depth cameras, binocular cameras and multi-camera systems. Because of the possible occlusion of the photographed object, especially the self occlusion of the extremities of the human body when photographing the human body, monocular (depth) cameras are limited in the range of view angles and often cannot perform effective bone recognition, and binocular cameras cannot cover the full view angle, a multi-view camera system is most suitable for this task.

However, the number of cameras and the positions of the cameras in the multi-camera system are all problems to be considered. Theoretically, the more cameras, the more viewing angles covered, the better the effect should be. But the more cameras, the more easily the errors due to calibration errors of the cameras. At present, the conventional method is to exclude outliers and then triangulate (triangulate) by using a random sampling consistency (Random Sample Consensus) algorithm, but this method does not consider spatial position information of a camera, and thus the effect is not good.

Disclosure of Invention

The invention aims to solve the existing problems and aims to provide an optimal visual angle selection method and a three-dimensional human skeleton detection method of a multi-view camera system.

In order to achieve the above objective, the present invention provides a method for selecting an optimal viewing angle of a multi-camera system in three-dimensional human skeleton detection, wherein the multi-camera system obtains two-dimensional key points, and each camera viewing angle v _i Corresponding key point detection d _i And a confidence level s _i The method comprises the steps of carrying out a first treatment on the surface of the Confidence s _i Marking values of 0 and 1 as key points, wherein the key points are at the camera view angle v _i And 0 if not visible; the visible value is 1;

according to the confidence s _i Establishing an initial optimal camera set as initialization, and setting confidence s _i And taking the camera view angle set corresponding to the key points larger than or equal to the threshold value gamma as an initial optimal view angle combination.

Wherein, the threshold value gamma is [0.5,0.9].

Wherein the confidence s is to be _i Discarding camera view angles of 0; and selecting a camera with a large average distance from the optimal view angle combination.

Wherein, the camera view angle is set as a node, connected into edges and forms a graph g=<v，ε>The method comprises the steps of carrying out a first treatment on the surface of the And marks nodes with 0, 1 values by the following function: e (E) _i，j ＝E _U (v _i →l _i )+λE _B (v _i →l _i ，v _j →l _j ) The method comprises the steps of carrying out a first treatment on the surface of the Where v denotes the camera set and ε denotes the connection relationshipAnd (5) collecting.

Wherein E is _u Is a similarity energy function for measuring the sign l _i For camera v _i The formula is as follows:

where i, j e 1,2,..n.

Wherein, the binary item E _B A constraint on the selected camera position and viewing angle direction is defined as follows:

E _B (v _i ，v _j )＝d(c _i ，c _j )+Θ(c _i ，c _j )

d(c _i ，c _j ) Is the Euclidean distance between camera i and camera j, Θ (c _i ，c _j ) Is the normalized value of the included angle of the viewing directions (lookahead) of the two cameras.

The invention also provides a three-dimensional human skeleton detection method, which comprises the following steps:

step one, data acquisition, namely acquiring a plurality of human body pictures by utilizing a multi-camera system,

step two, obtaining two-dimensional key points and corresponding confidence coefficients of the human body picture;

step three, establishing a view angle diagram according to the confidence level and the camera space information, and selecting an optimal camera view angle according to the optimal view angle selection method;

and fourthly, performing two-dimensional point-to-three-dimensional triangularization and bundling adjustment according to the selected optimal view angle combination, and finally generating three-dimensional skeleton points.

The multi-view camera system comprises a dome, 72 cameras distributed in an array are arranged in the dome, and camera view angles point to the center of the dome.

In the second step, two-dimensional key points of the human body on each frame of image are obtained through the deep neural network OpenPose, namely two-dimensional skeleton points of the human body are identified from each frame of image, wherein the two-dimensional skeleton points comprise 25 points on the body and 21 points on the hand. In step four, each camera V is given _i A two-dimensional bone point which is visible belowCamera projection matrix->The following formula is listed:

AX _i ＝0

X _i representing the desired three-dimensional point location.

In the fourth step, the result is further optimized by using the bundle adjustment formula:

image v given an optimal set of viewing angles ₁ ，v ₂ ，...，v _n ，x _ij Representing the network at v _i The bone point s with the sequence number j detected above _ij For network pair x _ij Confidence of p _vi Representing v _i Is provided.

Compared with the prior art, the invention can simply, conveniently and quickly select the optimal visual angle combination from the multi-camera system to triangulate, thereby obtaining more accurate three-dimensional human bones; the method solves the problem of errors caused by excessive cameras in the conventional multi-camera system, introduces the spatial position information of the cameras on the basis of the traditional algorithm, and improves the robustness and accuracy of visual angle selection.

Drawings

FIG. 1 is a schematic diagram of a multi-view camera system;

FIG. 2a is a schematic diagram of OpenPose detecting two-dimensional skeletal points of a human body;

FIG. 2b is a schematic diagram of the result obtained after triangulating with all viewing angles and matching with the point cloud;

fig. 2c is a schematic diagram of the result of triangularization and point cloud matching after viewing angle selection.

Detailed Description

Referring to fig. 1 to 2c, fig. 1 shows a three-dimensional human skeleton detection method according to an embodiment of the present invention, which is specifically described below.

Step one, data acquisition. Referring to fig. 1, an embodiment of the present invention employs a dome (dome) multi-camera system with a diameter of 8 meters and a height of 5 meters, wherein 72 cameras are distributed in the dome and the cameras are all directed toward the center of the dome, and the specific camera distribution positions are shown in fig. 1. In fig. 1, each white camera icon represents a three-dimensional position of a camera, a semitransparent rectangle represents an image plane of the camera, and the resolution of the camera adopted in the embodiment is above 2000×2000. The present embodiment also uses a checkerboard textured mannequin to calibrate the multi-camera system using conventional algorithms.

In the data acquisition process, the acquisition object does free motion in the dome (dome) center, the camera acquires the dynamic video at a frame rate of 30 frames per second, and then the embodiment performs individual operation on each frame of the dynamic video. And step two, acquiring two-dimensional key points of the human body. The invention utilizes the existing deep neural network OpenPose to obtain two-dimensional key points of human bodies on 72 images of each frame. openPose is a human body posture estimation algorithm, based on a convolutional neural network and an open source library which is written by taking caffe as a framework and can realize the tracking of facial expression, trunk, limbs and even fingers of a person, and the algorithm learns a human body posture estimation model from a large amount of data based on deep learning. openPose is not only applicable to single person but also applicable to multiple persons, has better robustness simultaneously. The model can identify two-dimensional skeleton points of a human body from a single image, wherein the two-dimensional skeleton points comprise 25 points on the body and 21 points on hands, and in addition, openPose can provide a floating point number of 0 to 1 for the identification result to represent the confidence of the two-dimensional skeleton points. Referring to fig. 2a, the lines in the figure are keypoint links.

And thirdly, establishing a view map (view map), and adopting the optimal view selection method. After obtaining two-dimensional keypoints of human body, the prior art generally triangulates (triangularizes) the positions of the keypoints and camera parameters together in all 72 pictures to obtain three-dimensional keypoints. However, in actual operation, it is found that the detection results of some pictures are very poor, and the final three-dimensional result is affected. Therefore, the step of selecting the picture view angle is added, the purpose of selecting the picture view angle is to exclude the view angle with detection errors, and the selected view angle is ensured to enable the three-dimensional point to be more accurate after triangularization.

This problem of optimal view selection is defined as a 0, 1 mark (binarybeabling) problem in this embodiment. I.e. given the existing camera set v= { V ₁ ，v ₂ ，...v _n Each camera view v _i All correspond to a key point detection d _i And a confidence level s _i 。

The confidence level indicates the confidence level of the current detection, and the higher the numerical value is, the more confidence the network has for the detected result is, and the more accurate the detected result is. The confidence level is 0 when the keypoint is not visible in the graph (the picture only half a body is taken) or is not detected due to occlusion. The embodiment can establish an initial optimal camera set as initialization according to the confidence level.

In this embodiment, a camera view angle set corresponding to a key point with a confidence level greater than or equal to a threshold value γ is used as an initial optimal view angle combination. The threshold gamma is generally between 0.5 and 0.9, and can be specifically adjusted according to the number of the selected cameras, so that the number of the cameras in the optimal view angle combination is ensured to be at least more than 10% of the total number of the cameras. For camera perspectives corresponding to keypoints with confidence equal to 0, the present embodiment treats them as an initial unselected perspective combination.

Next, the present embodiment regards each camera view as a node, the nodes are two by two as edges, and the point set and the edge set may constitute a graph g=<v，ε>The method comprises the steps of carrying out a first treatment on the surface of the Where v denotes a camera set and epsilon denotes a connection relation set. The present embodiment marks l for each node with a 0, 1 sign by minimizing an energy function _i E { select (=1), discard (=0) }. The present embodiment defines a unigram (unartymerm) and a binaryterm (binaryterm), respectively, for this energy function.

E _i，j ＝E _U (v _i →l _i )+λE _B (v _i →l _i ，v _j →l _j )

Unitary item E _U Is a similarity function for measuring the index l _i For camera v _i The formula is as follows:

where i, j e 1,2,..n.

According to the above formula, the embodiment forcibly discards the camera view angle with the confidence level of 0, namely, selects the mark l _i Energy of=1 (E _U (v _i =1) = infinity), whereas the selection marker l _i Energy of=0 (E _U (v _i =0) is 0. For the forced selection of points with confidence coefficient larger than or equal to gamma, selecting a marker l in the same way _i Energy of=1 (E _U (v _i =1) =0) is 0, and the selection marker l _i Energy of=0 (E _U (v _i =0) = infinity) is infinity.

In the formulaIs the viewing angle v _i To initial optimal viewing angle combinationAverage distance of>Is the viewing angle v _i Average distance to the initially non-selected view angle combination. The third row of formulas means that for points with confidence greater than 0 and less than γ, the present embodiment selects the basis for the distance of the camera from the set of known perspectives. The larger the average distance from the optimal viewing angle combination is +.>The larger the corresponding according to the formula +.>The less energy is selected for this point. The greater the average distance of the initially non-selected viewing angle combinations, the greater the +.>The larger is according to the formula +.>The greater the energy to mark the point as 0. These two formulas mean that the present embodiment tends to select camera points with large average distances from the optimal view angle combination, and such camera combination positions are more dispersed and can cover the entire scene.

Binary item E _B Is a constraint on the selected camera position and viewing angle direction, and is specifically defined as follows:

E _B (v _i ，v _j )＝d(c _i ，c _j )+Θ(c _i ，c _j )

wherein d (c) _i ，c _j ) Is the Euclidean distance between camera i and camera j, Θ (c _i ，c _j ) Is the normalized value of the included angle of the viewing directions (lookahead) of the two cameras. In the present invention, the present embodiment sets the weight λ of the binary item to 0.1. The distance between the two cameras is small, d (c) _i ，c _j ) The smaller the camera view angle direction is, the closer the camera view angle direction is, Θ (c _i ，c _j ) The smaller. That is, a pair of cameras with similar viewing angles and close distances, their E _B (v _i ，v _j ) The smaller will be. Then the easier they are to separate into different labels in the graph cut algorithm. That is, the binary item ensures that cameras with close distances and similar viewing angles are given different marks, so that the camera selected by the embodiment can see more different angles according to the constraint of the binary item, and can acquire more information.

The present embodiment minimizes the energy function E using a graph cut algorithm (graph) _i，j Solving the problems, the embodiment finally can obtain a camera set which satisfies the requirements of high confidence of key point detection, long camera distance and dissimilar view angles. A set of cameras meeting these conditions can give more accurate results when triangulating (triangulating).

And step four, generating three-dimensional skeleton points. According to the algorithm, the embodiment establishes a view map (view graph) for each key point, selects an optimal view set, and then uses a triangularization (triangularization) and cluster adjustment (bundle adjustment) algorithm to recover more accurate three-dimensional skeleton points of the human body. The means of triangulating to restore the multi-view camera to three dimensions are as follows: given each camera V _i A two-dimensional bone point which is visible belowCamera projection matrixThe present embodiment may list the following formula, where X _i Representing the required three-dimensional point positions:

AX _i ＝0

solving the linear equation set to obtain the three-dimensional position X of the key point _i . By triangulating each key point, the present embodiment results in a complete three-dimensional bone comprising 25 points on the body and 21 points on each hand. The present embodiment further optimizes the result by using bundle adjustment, so as to use the confidence level of the openPose, so that the result is more accurate.

The bundle adjustment formula may be expressed as:

image v given an optimal set of viewing angles ₁ ，v ₂ ，...，v _n ，x _ij Representing the network at v _i The bone point s with the sequence number j detected above _ij For network pair x _ij Confidence of p _vi Representing v _i Is provided. The objective of the algorithm optimization of this example is to have all 67 (25 per body, 21 per hand) three-dimensional skeletal points X ₁ ，X ₂ ，...，X ₆₇ Reprojection transformation P at each view angle _vi Then, two-dimensional skeleton point x of network _ij As close as possible.

Through the above-mentioned triangularization and beam-gathering adjustment, the present embodiment can finally obtain a group of very accurate three-dimensional skeletal points of the human body.

The above operation is performed on each frame of the dynamic video, and the embodiment can obtain a set of accurate three-dimensional dynamic bones.

The embodiments of the present invention have been described above with reference to the accompanying drawings and examples, which are not to be construed as limiting the invention, and those skilled in the art can make modifications as required, all of which are within the scope of the appended claims.

Claims

1. An optimal visual angle selection method of a multi-camera system in three-dimensional human skeleton detection, wherein the multi-camera system obtains two-dimensional key points, and the method is characterized in that:

each camera view v _i Corresponding key point detection d _i And a confidence level s _i ；

Confidence s _i Marking values of 0 and 1 as key points, wherein the key points are at the camera view angle v _i The invisible number is 0, and the visible number is 1;

according to the confidence s _i Establishing an initial optimal camera set as initialization, and setting confidence s _i A camera view angle set corresponding to a key point larger than or equal to a threshold value gamma is used as an initial optimal view angle combination; for confidence s _i Selecting a camera with a large average distance combined with the optimal view angle from points larger than 0 and smaller than gamma; confidence s _i A camera view angle of 0 is discarded.

2. The method for selecting an optimal view angle of a multi-view camera system in three-dimensional human bone detection according to claim 1, wherein: threshold gamma e 0.5,0.9.

3. The optimal viewing angle selection method of a multi-camera system in three-dimensional human skeleton detection according to claim 1 or 2, wherein: the view angle of the camera is set as a node, connected into edges and forms a graphAnd marks nodes with 0, 1 values by the following function: e (E) _i,j ：＝E _U (v _i →l _i )+λE _B (v _i →l _i ,v _j →l _j ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Represents a camera set and epsilon represents a connection relation set.

4. A method for selecting an optimal view angle for a multi-camera system in three-dimensional human bone detection according to claim 3, wherein: wherein E is _U Is a similarity energy function for measuring the sign l _i For camera v _i The formula is as follows:

where i, j e 1,2, …, n.

5. The method for selecting an optimal view angle for a multi-camera system in three-dimensional human bone detection according to claim 4, wherein: binary item E _B A constraint on the selected camera position and viewing angle direction is defined as follows:

E _B (v _i ,v _j )＝d(c _i ,c _j )+Θ(c _i ,c _j )

d(c _i ,c _j ) Is the Euclidean distance between camera i and camera j, Θ (c _i ,c _j ) Is the normalized value of the included angle of the viewing angle directions of the two cameras.

6. A three-dimensional human skeleton detection method is characterized in that:

step three, establishing a view angle diagram according to the confidence level and the camera space information, and selecting an optimal camera view angle according to the optimal view angle selection method according to any one of claims 1-5;

7. The method for three-dimensional human bone detection according to claim 6, wherein: in the second step, two-dimensional key points of the human body on each frame of image are obtained through the deep neural network OpenPose, namely two-dimensional skeleton points of the human body are identified from each frame of image, wherein the two-dimensional skeleton points comprise 25 points on the body and 21 points on the hand.

8. The method for three-dimensional human bone detection according to claim 6, wherein: in step four, each camera V is given _i A two-dimensional bone point which is visible belowCamera projection matrix->The following formula is listed:

AX _i ＝0

X _i representing the desired three-dimensional point location.

9. The three-dimensional human bone detection method according to claim 8, wherein: in the fourth step, the result is further optimized by using the bundle adjustment formula:

image v given an optimal set of viewing angles ₁ ,v ₂ ,…,v _n ，x _ij Representing the network at v _i The bone point s with the sequence number j detected above _ij For network pair x _ij Confidence of p _vi Representing v _i Is provided.