CN110910449A

CN110910449A - Method and system for recognizing three-dimensional position of object

Info

Publication number: CN110910449A
Application number: CN201911223409.9A
Authority: CN
Inventors: 陈健生; 薛有泽; 万纬韬; 张馨予
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-03-24
Anticipated expiration: 2039-12-03
Also published as: CN110910449B

Abstract

The invention provides a method and a system for identifying the three-dimensional position of an object, wherein the method comprises the following steps: acquiring a plurality of videos respectively shot by a plurality of camera devices on the same object; determining two-dimensional positions of key points of the object in the plurality of videos respectively; predicting the three-dimensional position of the key point according to the two-dimensional position by utilizing a neural network; determining the projection positions of the key points in the imaging surfaces of the plurality of camera devices according to the three-dimensional positions and the parameters of the plurality of camera devices; calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function.

Description

Method and system for recognizing three-dimensional position of object

Technical Field

The invention relates to the field of image recognition, in particular to a method and a system for recognizing a three-dimensional position of an object.

Background

Currently, neural networks have been used to estimate three-dimensional positions from two-dimensional images of objects, and existing algorithms directly infer three-dimensional coordinates using two-dimensional keypoint coordinates of a single view as input. After testing the existing neural network estimation algorithm on some videos, experimental results show that the generalization capability of the methods is poor.

The main reasons for poor generalization capability in the prior art are two, firstly, a single visual angle cannot provide enough three-dimensional information, a three-dimensional structure deduced by a neural network depends on the statistical characteristics of training data, and the three-dimensional structure cannot be correctly migrated in the face of a new scene and different camera configurations; secondly, the difference between the actual use environment and the commonly used public data set such as Human3.6M and other scenes is large, and the model trained on the data set cannot be generalized to the actual application scene.

Disclosure of Invention

In view of the above, the present invention provides a method for recognizing a three-dimensional position of an object, including:

acquiring a plurality of videos respectively shot by a plurality of camera devices on the same object;

determining two-dimensional positions of key points of the object in the plurality of videos respectively;

predicting the three-dimensional position of the key point according to the two-dimensional position by utilizing a neural network;

determining the projection positions of the key points in the imaging surfaces of the plurality of camera devices according to the three-dimensional positions and the parameters of the plurality of camera devices;

calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function.

Optionally, in the step of predicting the three-dimensional position of the keypoint from the two-dimensional position by using a neural network, the two-dimensional position of the keypoint in one of the videos is used as input data of the neural network, so that the neural network outputs the three-dimensional position.

Optionally, the plurality of videos are videos captured by an odd number of cameras that are closely spaced horizontally, and the input data is taken from a video captured by a camera that is centered horizontally.

Optionally, the determining the two-dimensional positions of the key points of the object in the plurality of videos respectively comprises:

respectively determining the areas of the objects in the plurality of videos by using the trained object detection network;

and respectively determining the two-dimensional positions of the key points in the region by utilizing the trained key point detection network.

Optionally, before acquiring a plurality of videos respectively captured by a plurality of image capturing apparatuses for the same object, the method further includes: parameters of the neural network are initialized by using training data, wherein the training data are a plurality of videos shot by a plurality of camera devices on the same object, and the videos comprise the process that the object is far away from or close to the camera devices.

Optionally, the initialization is divided into two phases, the loss functions used in the two phases being different.

Optionally, the loss function used in the first stage updates the parameters of the neural network with a first optimization goal, the first optimization goal being to make the depth coordinates of the three-dimensional positions of the object key points in the training data output by the neural network positive;

the loss function used in the second stage updates the parameters of the neural network with a second optimization objective that reconciles the projected locations and the two-dimensional locations of the object keypoints in the training data on the basis of the first optimization objective.

Optionally, the neural network is a long-short term memory network.

Optionally, the object is a human body, and the key points include a plurality of parts of the human body.

The present invention also provides a system for identifying a three-dimensional position of an object, comprising:

a plurality of image pickup devices for respectively picking up videos of the same object;

and the terminal is used for identifying the three-dimensional position of the object according to the method for identifying the three-dimensional position of the object.

The method for identifying the three-dimensional position of the object combines a data-driven neural network with a traditional optimization method of artificial modeling, converts a two-dimensional key point coordinate sequence into a three-dimensional coordinate sequence by using the neural network, converts the optimization problem of the three-dimensional key point coordinate into parameter optimization of the neural network, and can better constrain the time sequence relation of the coordinate compared with the method for directly optimizing the three-dimensional coordinate. In addition, the invention adopts an optimization rather than direct inference mode to estimate the three-dimensional position, fully utilizes the video information shot by a plurality of visual angles, and applies definite geometric constraint on the three-dimensional key point coordinates, so that the identification process has higher efficiency, the identification result has higher accuracy, and the problem of weak generalization capability commonly existing in the prior art is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method of identifying a three-dimensional position of an object in an embodiment of the invention;

FIGS. 2 and 3 are schematic views of a scene of a system for identifying a three-dimensional position of an object according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a process of recognizing a three-dimensional position of a human body in an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a method for identifying the three-dimensional position of an object, which can be executed by electronic equipment such as a computer, a server and the like, and as shown in figure 1, the method comprises the following steps:

s1, a plurality of images obtained by the plurality of imaging devices respectively capturing the same object are acquired. The number of the plurality of cameras may be 2, 3 or more, and in an actual usage scenario, the heights of the cameras should be substantially the same, have a certain interval in the horizontal direction, and all face towards the object to be photographed.

These cameras capture a single object simultaneously to obtain i views of the video.

And S2, respectively determining the two-dimensional positions of the key points of the object in the plurality of videos. The video is composed of a sequence of images (frames), and taking a frame of image at time t of the video as an example, time t corresponds to i images, and the obtained two-dimensional position can be represented as

Meaning the two-dimensional position of the keypoint m at time t on the image at view angle i. There are many methods for determining the two-dimensional coordinates of a point in a two-dimensional image, and this scheme can use any one of the existing techniques to achieve this operation.

And S3, predicting the three-dimensional position of the key point according to the two-dimensional position by utilizing the neural network. The neural network uses the current (or initialized) parameters to determine the current position of the image from one of the i images at time t,In a plurality of or all images

Outputting a three-dimensional coordinate, namely the three-dimensional position of the key point m at the time t, which is expressed as

And S4, determining the projection positions of the key points in the imaging plane of each camera according to the three-dimensional positions and the parameters of the plurality of cameras. The parameters of the camera device can be calibrated when a hardware environment is built, the parameters specifically comprise internal parameters and external parameters, firstly, the internal parameters of the three cameras can be calibrated by using black and white grid pictures, and only a camera calibration tool box of MATLAB is used. And then, calibrating external parameters of the camera by using the COLMAP, selecting a group of pictures shot by each camera at the same time, and obtaining the external parameters while performing sparse reconstruction by using the sparse reconstruction function of the COLMAP. The calibrated camera device does not move any more, and then is not calibrated again when a video is shot, and parameters which are well calibrated when an environment is built are used.

Projecting the three-dimensional coordinates to each view angle by using pre-calibrated parameters to obtain projection coordinates

The meaning is the projection position of the three-dimensional coordinate of the key point m at the time t in the view angle i.

And S5, calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function. The three-dimensional coordinates of the key point m should be consistent in each view angle and should coincide with the two-dimensional key point coordinates of each view angle after projection, that is, the three-dimensional coordinates of the key point m should be consistent in each view angle

Should be in contact with

Are consistent (equal or substantially equal) and thus optimization can be defined according to the difference between the twoThe target, i.e. the determined loss function L, can be expressed as

L should be as small as possible.

For example, the three-dimensional position of the key point at the time t is estimated for the image at the time t, the loss function is calculated and the parameters of the neural network are optimized, and then the optimized parameters are adopted for estimating the three-dimensional position of the key point at the time t +1 for the neural network at the time t + 1. The optimization target L defined above can be optimized in a gradient descending mode, and parameters of the neural network are continuously updated in the optimization process until the three-dimensional key points estimated by the neural network are consistent with the two-dimensional key points of each visual angle. Therefore, the structure of the neural network and the optimization idea are combined together, and the three-dimensional key point coordinates with view angle consistency are indirectly obtained by optimizing the parameters of the neural network.

The method can determine the two-dimensional position of the key point of the object in the video frame by frame, estimate the three-dimensional position of the key point frame by frame, optimize the network frame by frame, and execute the processing at intervals of a plurality of frames, so the time t and the time t +1 are only used for explaining the time sequence relation of two times, but not used for limiting two adjacent frames.

Because the image sequence (video) is input into the neural network, and the key points have a time-sequence relation in the sequence, the neural network preferably adopts a cyclic neural network with a multi-layer LSTM (long short-Term Memory) structure, the LSTM solves the long-Term dependence problem of sequence input, and can effectively utilize the input time-sequence relation, so that the efficiency and the accuracy of predicting the three-dimensional positions of the key points are higher.

The above steps are both identification process and training process, the parameters of the neural network can be initialized randomly, then identification is carried out aiming at the input video, thereby correcting the parameters of the neural network, and the identification result is not output to the user until the set convergence condition is reached.

In order to improve the recognition efficiency, the parameters of the neural network may also be initialized in a specific manner before recognition, that is, the neural network is trained using specific training data before being used for recognition. The training data is video data and is obtained by shooting through the camera device with the i visual angles, the training data comprises the process that an object is far away from or close to the camera device, and the object in the training data is also preset with key points.

In the training process, operations of recognizing the two-dimensional position, predicting the three-dimensional position, and calculating the projected position are performed with reference to the above steps S2 to S5, and a loss function is different from the recognition process. In the present embodiment the training process is divided into two phases, where different loss functions are used, i.e. the optimization objectives are different. In particular, the loss function used in the first stage updates the parameters of the neural network with a first optimization objective of making the depth coordinates of the three-dimensional positions of the key points of the object in the training data output by the neural network positive, such as with a first optimization objective of the depth coordinates of the three-dimensional positions of the key points of the object in the training data output by the neural network

Is trained for a loss function, wherein

Representing the z coordinate in the three-dimensional coordinates of the key point m at time t, τ is a constant greater than 0, and the first stage of training is ended by continuously updating the parameters of the neural network until Q is 0.

The loss function used in the second stage updates the parameters of the neural network with a second optimization objective that reconciles the projected locations of the object keypoints in the training data with the two-dimensional locations based on the first optimization objective. The second stage trains with Q + L as a loss function, namely, the state that the network is still prevented from diverging to Q >0 while the consistency is required.

After the training of the two stages is finished, the network parameters are stored, then the parameters are used as initial parameters during recognition, and then the L can be continuously optimized in the recognition process. Tests show that the above initialization method can ensure optimized convergence. In practical application, the network is initialized in this way, optimization can be converged within 10 minutes, and the requirements of practical application scenarios can be met.

In a preferred embodiment, the step S2 includes:

and S21, determining the areas where the objects are located in the plurality of videos respectively by using the trained object detection network. For example, the object detection network MASK-RCNN may be used to perform object detection on each frame of image of each video to obtain the frame of the target object. In order to suppress the occurrence of the false detection phenomenon, it may be required that only the detection frame with the highest degree of confidence is retained if a plurality of target objects are detected in one image. The object position of each frame image is composed of a quadruple (x)₁,y₁,x₂,y₂) And representing the pixel coordinates of the upper left corner and the lower right corner of the detection box. If the target object cannot be correctly detected in the partial image, the partial image is represented by (0, 0, 1, 1) output.

And S22, respectively determining the two-dimensional positions of the key points in the areas by using the trained key point detection network. For example, when the recognized object is a human body, the CPN (masked Pyramid network) can be used to mark the position of the human body in the image sequence, and each frame of image and the corresponding human body detection frame are sent to the CPN to obtain the pixel coordinates of each key point. The original CPN can be trained by adopting a COCO data set, the COCO data set only has two-dimensional key point labels, different human body key point definitions are adopted, and in order to obtain key point representation consistent with a three-dimensional data set Human3.6M, the CPN is trained by adopting the two-dimensional key point labels of the Human3.6M data set.

In a specific embodiment, the method is used in a medical scene, the identified object is a human body, and a plurality of parts of the human body are defined as key points. As shown in fig. 2, the present embodiment provides a system for recognizing a three-dimensional position of an object, which includes three cameras and a terminal for data processing.

The three cameras are placed in front of a channel with the length of about 6 meters, the three cameras are kept close in height, and the inside of the channel is shot from the left front, the right front and the right front respectively. The three cameras synchronously shoot at the same frame rate (25 frames/second), the height and the width of each frame of image are 1920 pixels and 1080 pixels respectively, and the video acquisition scene is shown in fig. 2 and 3.

And calibrating the three cameras after the cameras are built. Firstly, calibrating the internal parameters of the three cameras by using black and white grid pictures, and directly using a camera calibration tool box of MATLAB. And then calibrating the external parameters of the camera by using COLMAP. And selecting a group of pictures shot by the three cameras at the same time, and obtaining camera parameters while performing sparse reconstruction by using the sparse reconstruction function of COLMAP. After the cameras are calibrated, the three cameras do not move any more, then the cameras are not re-calibrated when new videos are shot, and the parameters which are well calibrated when the environment is built are all used.

A chair is placed at the far end of a channel with the length of about 6 meters, transverse lines are marked on the channel every about 60 centimeters, a red line is marked at a position about 2.5 meters away from the camera, and a patient needs to complete the turning action before the red line. The outside of the channel is provided with a separation plate for shielding external interference, and sufficient illumination is provided above the channel. The three cameras are connected to a terminal (PC) in the form of a USB, the terminal being provided with dedicated software for operating the cameras for taking, storing and processing and analyzing the video.

The three cameras are used for collecting videos, firstly, a patient sits on a chair at the far end of the channel, after shooting starts, the patient is required to gradually stand up from a sitting posture and move to the near end of the channel, the patient turns to the chair at the far end after walking for about 3.5 meters and sits down, and once shooting can be finished. In the whole shooting process, only one patient appears in the pictures shot by the three cameras, and other irrelevant people cannot enter the channel. According to the walking speed of the patient, the time length of one shooting is about 10 seconds to 20 seconds, and the shooting time length of the patient with serious walking obstacle can reach more than one minute.

After the three sections of videos are collected, the terminal executes an operation of identifying three-dimensional positions of key points of the human body, in this embodiment, the key points of the human body include 17 positions, namely, a top of the head, a nose, a neck, a left shoulder, a left elbow, a left wrist, a right shoulder, a right elbow, a right wrist, a back, a center of two hip, a left knee, a left ankle, a right hip, a right knee and a right ankle.

The terminal identifies the positions of the key points by using the method shown in fig. 1, and in this embodiment, the time-series two-dimensional key points of the front view are used as input to obtain the three-dimensional key point coordinates of each moment. The neural network identification and optimization process is shown in fig. 4, for example, the two-dimensional position of the key point m at time t at the left anterior angle video1 is

In the two-dimensional position of the front view video2

At the two-dimensional position of right anterior viewing angle video3

The input data of the neural network is

Neural network based on

Outputting the three-dimensional position of the key point m at the moment t

Then according to

Calculating the projection positions of the key points on the three visual angles according to the parameters of the camera

And

from which it is possible to calculate

And

the difference of,

And

the difference of,

And

so as to optimize the parameters of the neural network according to the difference.

The parameter initialization of the neural network may refer to the training scheme in the above embodiments, and is not described herein.

After the three-dimensional positions of all key points of the human body are obtained, the posture of the human body can be analyzed. The coordinates of the three-dimensional key points can represent the posture of the human body, the change condition of the data along with time can represent the motion state of the human body, and the data can be used for diagnosing or analyzing related diseases.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A method of identifying a three-dimensional position of an object, comprising:

2. The method according to claim 1, wherein in the step of predicting the three-dimensional position of the keypoint from the two-dimensional position using a neural network, the two-dimensional position of the keypoint in one of the videos is used as input data to the neural network, and the neural network is caused to output the three-dimensional position.

3. The method of claim 2, wherein the plurality of videos are videos captured by an odd number of closely spaced horizontally spaced cameras, and wherein the input data is taken from a video captured by a horizontally centered camera.

4. The method of claim 1, wherein determining the two-dimensional locations of the keypoints of the object in the plurality of videos respectively comprises:

5. The method according to claim 1, before acquiring a plurality of videos respectively photographed by a plurality of photographing devices on the same object, further comprising: parameters of the neural network are initialized by using training data, wherein the training data are a plurality of videos shot by a plurality of camera devices on the same object, and the videos comprise the process that the object is far away from or close to the camera devices.

6. The method according to claim 5, characterized in that the initialization is divided into two phases, in which the loss functions used are not identical.

7. The method of claim 6, wherein the loss function used in the first stage updates the parameters of the neural network with a first optimization goal of making the depth coordinates of the three-dimensional positions of the object keypoints in the training data output by the neural network positive;

8. The method of any one of claims 1-7, wherein the neural network is a long-short term memory network.

9. A system for identifying a three-dimensional position of an object, comprising:

a terminal for identifying three-dimensional locations of key points of an object according to the method of any one of claims 1-8.