CN110910449A - Method and system for recognizing three-dimensional position of object - Google Patents

Method and system for recognizing three-dimensional position of object Download PDF

Info

Publication number
CN110910449A
CN110910449A CN201911223409.9A CN201911223409A CN110910449A CN 110910449 A CN110910449 A CN 110910449A CN 201911223409 A CN201911223409 A CN 201911223409A CN 110910449 A CN110910449 A CN 110910449A
Authority
CN
China
Prior art keywords
neural network
dimensional
videos
dimensional position
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911223409.9A
Other languages
Chinese (zh)
Other versions
CN110910449B (en
Inventor
陈健生
薛有泽
万纬韬
张馨予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201911223409.9A priority Critical patent/CN110910449B/en
Publication of CN110910449A publication Critical patent/CN110910449A/en
Application granted granted Critical
Publication of CN110910449B publication Critical patent/CN110910449B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Length Measuring Devices By Optical Means (AREA)

Abstract

The invention provides a method and a system for identifying the three-dimensional position of an object, wherein the method comprises the following steps: acquiring a plurality of videos respectively shot by a plurality of camera devices on the same object; determining two-dimensional positions of key points of the object in the plurality of videos respectively; predicting the three-dimensional position of the key point according to the two-dimensional position by utilizing a neural network; determining the projection positions of the key points in the imaging surfaces of the plurality of camera devices according to the three-dimensional positions and the parameters of the plurality of camera devices; calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function.

Description

Method and system for recognizing three-dimensional position of object
Technical Field
The invention relates to the field of image recognition, in particular to a method and a system for recognizing a three-dimensional position of an object.
Background
Currently, neural networks have been used to estimate three-dimensional positions from two-dimensional images of objects, and existing algorithms directly infer three-dimensional coordinates using two-dimensional keypoint coordinates of a single view as input. After testing the existing neural network estimation algorithm on some videos, experimental results show that the generalization capability of the methods is poor.
The main reasons for poor generalization capability in the prior art are two, firstly, a single visual angle cannot provide enough three-dimensional information, a three-dimensional structure deduced by a neural network depends on the statistical characteristics of training data, and the three-dimensional structure cannot be correctly migrated in the face of a new scene and different camera configurations; secondly, the difference between the actual use environment and the commonly used public data set such as Human3.6M and other scenes is large, and the model trained on the data set cannot be generalized to the actual application scene.
Disclosure of Invention
In view of the above, the present invention provides a method for recognizing a three-dimensional position of an object, including:
acquiring a plurality of videos respectively shot by a plurality of camera devices on the same object;
determining two-dimensional positions of key points of the object in the plurality of videos respectively;
predicting the three-dimensional position of the key point according to the two-dimensional position by utilizing a neural network;
determining the projection positions of the key points in the imaging surfaces of the plurality of camera devices according to the three-dimensional positions and the parameters of the plurality of camera devices;
calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function.
Optionally, in the step of predicting the three-dimensional position of the keypoint from the two-dimensional position by using a neural network, the two-dimensional position of the keypoint in one of the videos is used as input data of the neural network, so that the neural network outputs the three-dimensional position.
Optionally, the plurality of videos are videos captured by an odd number of cameras that are closely spaced horizontally, and the input data is taken from a video captured by a camera that is centered horizontally.
Optionally, the determining the two-dimensional positions of the key points of the object in the plurality of videos respectively comprises:
respectively determining the areas of the objects in the plurality of videos by using the trained object detection network;
and respectively determining the two-dimensional positions of the key points in the region by utilizing the trained key point detection network.
Optionally, before acquiring a plurality of videos respectively captured by a plurality of image capturing apparatuses for the same object, the method further includes: parameters of the neural network are initialized by using training data, wherein the training data are a plurality of videos shot by a plurality of camera devices on the same object, and the videos comprise the process that the object is far away from or close to the camera devices.
Optionally, the initialization is divided into two phases, the loss functions used in the two phases being different.
Optionally, the loss function used in the first stage updates the parameters of the neural network with a first optimization goal, the first optimization goal being to make the depth coordinates of the three-dimensional positions of the object key points in the training data output by the neural network positive;
the loss function used in the second stage updates the parameters of the neural network with a second optimization objective that reconciles the projected locations and the two-dimensional locations of the object keypoints in the training data on the basis of the first optimization objective.
Optionally, the neural network is a long-short term memory network.
Optionally, the object is a human body, and the key points include a plurality of parts of the human body.
The present invention also provides a system for identifying a three-dimensional position of an object, comprising:
a plurality of image pickup devices for respectively picking up videos of the same object;
and the terminal is used for identifying the three-dimensional position of the object according to the method for identifying the three-dimensional position of the object.
The method for identifying the three-dimensional position of the object combines a data-driven neural network with a traditional optimization method of artificial modeling, converts a two-dimensional key point coordinate sequence into a three-dimensional coordinate sequence by using the neural network, converts the optimization problem of the three-dimensional key point coordinate into parameter optimization of the neural network, and can better constrain the time sequence relation of the coordinate compared with the method for directly optimizing the three-dimensional coordinate. In addition, the invention adopts an optimization rather than direct inference mode to estimate the three-dimensional position, fully utilizes the video information shot by a plurality of visual angles, and applies definite geometric constraint on the three-dimensional key point coordinates, so that the identification process has higher efficiency, the identification result has higher accuracy, and the problem of weak generalization capability commonly existing in the prior art is solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a method of identifying a three-dimensional position of an object in an embodiment of the invention;
FIGS. 2 and 3 are schematic views of a scene of a system for identifying a three-dimensional position of an object according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a process of recognizing a three-dimensional position of a human body in an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a method for identifying the three-dimensional position of an object, which can be executed by electronic equipment such as a computer, a server and the like, and as shown in figure 1, the method comprises the following steps:
s1, a plurality of images obtained by the plurality of imaging devices respectively capturing the same object are acquired. The number of the plurality of cameras may be 2, 3 or more, and in an actual usage scenario, the heights of the cameras should be substantially the same, have a certain interval in the horizontal direction, and all face towards the object to be photographed.
These cameras capture a single object simultaneously to obtain i views of the video.
And S2, respectively determining the two-dimensional positions of the key points of the object in the plurality of videos. The video is composed of a sequence of images (frames), and taking a frame of image at time t of the video as an example, time t corresponds to i images, and the obtained two-dimensional position can be represented as
Figure BDA0002301483840000051
Meaning the two-dimensional position of the keypoint m at time t on the image at view angle i. There are many methods for determining the two-dimensional coordinates of a point in a two-dimensional image, and this scheme can use any one of the existing techniques to achieve this operation.
And S3, predicting the three-dimensional position of the key point according to the two-dimensional position by utilizing the neural network. The neural network uses the current (or initialized) parameters to determine the current position of the image from one of the i images at time t,In a plurality of or all images
Figure BDA0002301483840000052
Outputting a three-dimensional coordinate, namely the three-dimensional position of the key point m at the time t, which is expressed as
Figure BDA0002301483840000053
And S4, determining the projection positions of the key points in the imaging plane of each camera according to the three-dimensional positions and the parameters of the plurality of cameras. The parameters of the camera device can be calibrated when a hardware environment is built, the parameters specifically comprise internal parameters and external parameters, firstly, the internal parameters of the three cameras can be calibrated by using black and white grid pictures, and only a camera calibration tool box of MATLAB is used. And then, calibrating external parameters of the camera by using the COLMAP, selecting a group of pictures shot by each camera at the same time, and obtaining the external parameters while performing sparse reconstruction by using the sparse reconstruction function of the COLMAP. The calibrated camera device does not move any more, and then is not calibrated again when a video is shot, and parameters which are well calibrated when an environment is built are used.
Projecting the three-dimensional coordinates to each view angle by using pre-calibrated parameters to obtain projection coordinates
Figure BDA0002301483840000054
The meaning is the projection position of the three-dimensional coordinate of the key point m at the time t in the view angle i.
And S5, calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function. The three-dimensional coordinates of the key point m should be consistent in each view angle and should coincide with the two-dimensional key point coordinates of each view angle after projection, that is, the three-dimensional coordinates of the key point m should be consistent in each view angle
Figure BDA0002301483840000061
Should be in contact with
Figure BDA0002301483840000062
Are consistent (equal or substantially equal) and thus optimization can be defined according to the difference between the twoThe target, i.e. the determined loss function L, can be expressed as
Figure BDA0002301483840000063
L should be as small as possible.
For example, the three-dimensional position of the key point at the time t is estimated for the image at the time t, the loss function is calculated and the parameters of the neural network are optimized, and then the optimized parameters are adopted for estimating the three-dimensional position of the key point at the time t +1 for the neural network at the time t + 1. The optimization target L defined above can be optimized in a gradient descending mode, and parameters of the neural network are continuously updated in the optimization process until the three-dimensional key points estimated by the neural network are consistent with the two-dimensional key points of each visual angle. Therefore, the structure of the neural network and the optimization idea are combined together, and the three-dimensional key point coordinates with view angle consistency are indirectly obtained by optimizing the parameters of the neural network.
The method can determine the two-dimensional position of the key point of the object in the video frame by frame, estimate the three-dimensional position of the key point frame by frame, optimize the network frame by frame, and execute the processing at intervals of a plurality of frames, so the time t and the time t +1 are only used for explaining the time sequence relation of two times, but not used for limiting two adjacent frames.
The method for identifying the three-dimensional position of the object combines a data-driven neural network with a traditional optimization method of artificial modeling, converts a two-dimensional key point coordinate sequence into a three-dimensional coordinate sequence by using the neural network, converts the optimization problem of the three-dimensional key point coordinate into parameter optimization of the neural network, and can better constrain the time sequence relation of the coordinate compared with the method for directly optimizing the three-dimensional coordinate. In addition, the invention adopts an optimization rather than direct inference mode to estimate the three-dimensional position, fully utilizes the video information shot by a plurality of visual angles, and applies definite geometric constraint on the three-dimensional key point coordinates, so that the identification process has higher efficiency, the identification result has higher accuracy, and the problem of weak generalization capability commonly existing in the prior art is solved.
Because the image sequence (video) is input into the neural network, and the key points have a time-sequence relation in the sequence, the neural network preferably adopts a cyclic neural network with a multi-layer LSTM (long short-Term Memory) structure, the LSTM solves the long-Term dependence problem of sequence input, and can effectively utilize the input time-sequence relation, so that the efficiency and the accuracy of predicting the three-dimensional positions of the key points are higher.
The above steps are both identification process and training process, the parameters of the neural network can be initialized randomly, then identification is carried out aiming at the input video, thereby correcting the parameters of the neural network, and the identification result is not output to the user until the set convergence condition is reached.
In order to improve the recognition efficiency, the parameters of the neural network may also be initialized in a specific manner before recognition, that is, the neural network is trained using specific training data before being used for recognition. The training data is video data and is obtained by shooting through the camera device with the i visual angles, the training data comprises the process that an object is far away from or close to the camera device, and the object in the training data is also preset with key points.
In the training process, operations of recognizing the two-dimensional position, predicting the three-dimensional position, and calculating the projected position are performed with reference to the above steps S2 to S5, and a loss function is different from the recognition process. In the present embodiment the training process is divided into two phases, where different loss functions are used, i.e. the optimization objectives are different. In particular, the loss function used in the first stage updates the parameters of the neural network with a first optimization objective of making the depth coordinates of the three-dimensional positions of the key points of the object in the training data output by the neural network positive, such as with a first optimization objective of the depth coordinates of the three-dimensional positions of the key points of the object in the training data output by the neural network
Figure BDA0002301483840000071
Is trained for a loss function, wherein
Figure BDA0002301483840000072
Representing the z coordinate in the three-dimensional coordinates of the key point m at time t, τ is a constant greater than 0, and the first stage of training is ended by continuously updating the parameters of the neural network until Q is 0.
The loss function used in the second stage updates the parameters of the neural network with a second optimization objective that reconciles the projected locations of the object keypoints in the training data with the two-dimensional locations based on the first optimization objective. The second stage trains with Q + L as a loss function, namely, the state that the network is still prevented from diverging to Q >0 while the consistency is required.
After the training of the two stages is finished, the network parameters are stored, then the parameters are used as initial parameters during recognition, and then the L can be continuously optimized in the recognition process. Tests show that the above initialization method can ensure optimized convergence. In practical application, the network is initialized in this way, optimization can be converged within 10 minutes, and the requirements of practical application scenarios can be met.
In a preferred embodiment, the step S2 includes:
and S21, determining the areas where the objects are located in the plurality of videos respectively by using the trained object detection network. For example, the object detection network MASK-RCNN may be used to perform object detection on each frame of image of each video to obtain the frame of the target object. In order to suppress the occurrence of the false detection phenomenon, it may be required that only the detection frame with the highest degree of confidence is retained if a plurality of target objects are detected in one image. The object position of each frame image is composed of a quadruple (x)1,y1,x2,y2) And representing the pixel coordinates of the upper left corner and the lower right corner of the detection box. If the target object cannot be correctly detected in the partial image, the partial image is represented by (0, 0, 1, 1) output.
And S22, respectively determining the two-dimensional positions of the key points in the areas by using the trained key point detection network. For example, when the recognized object is a human body, the CPN (masked Pyramid network) can be used to mark the position of the human body in the image sequence, and each frame of image and the corresponding human body detection frame are sent to the CPN to obtain the pixel coordinates of each key point. The original CPN can be trained by adopting a COCO data set, the COCO data set only has two-dimensional key point labels, different human body key point definitions are adopted, and in order to obtain key point representation consistent with a three-dimensional data set Human3.6M, the CPN is trained by adopting the two-dimensional key point labels of the Human3.6M data set.
In a specific embodiment, the method is used in a medical scene, the identified object is a human body, and a plurality of parts of the human body are defined as key points. As shown in fig. 2, the present embodiment provides a system for recognizing a three-dimensional position of an object, which includes three cameras and a terminal for data processing.
The three cameras are placed in front of a channel with the length of about 6 meters, the three cameras are kept close in height, and the inside of the channel is shot from the left front, the right front and the right front respectively. The three cameras synchronously shoot at the same frame rate (25 frames/second), the height and the width of each frame of image are 1920 pixels and 1080 pixels respectively, and the video acquisition scene is shown in fig. 2 and 3.
And calibrating the three cameras after the cameras are built. Firstly, calibrating the internal parameters of the three cameras by using black and white grid pictures, and directly using a camera calibration tool box of MATLAB. And then calibrating the external parameters of the camera by using COLMAP. And selecting a group of pictures shot by the three cameras at the same time, and obtaining camera parameters while performing sparse reconstruction by using the sparse reconstruction function of COLMAP. After the cameras are calibrated, the three cameras do not move any more, then the cameras are not re-calibrated when new videos are shot, and the parameters which are well calibrated when the environment is built are all used.
A chair is placed at the far end of a channel with the length of about 6 meters, transverse lines are marked on the channel every about 60 centimeters, a red line is marked at a position about 2.5 meters away from the camera, and a patient needs to complete the turning action before the red line. The outside of the channel is provided with a separation plate for shielding external interference, and sufficient illumination is provided above the channel. The three cameras are connected to a terminal (PC) in the form of a USB, the terminal being provided with dedicated software for operating the cameras for taking, storing and processing and analyzing the video.
The three cameras are used for collecting videos, firstly, a patient sits on a chair at the far end of the channel, after shooting starts, the patient is required to gradually stand up from a sitting posture and move to the near end of the channel, the patient turns to the chair at the far end after walking for about 3.5 meters and sits down, and once shooting can be finished. In the whole shooting process, only one patient appears in the pictures shot by the three cameras, and other irrelevant people cannot enter the channel. According to the walking speed of the patient, the time length of one shooting is about 10 seconds to 20 seconds, and the shooting time length of the patient with serious walking obstacle can reach more than one minute.
After the three sections of videos are collected, the terminal executes an operation of identifying three-dimensional positions of key points of the human body, in this embodiment, the key points of the human body include 17 positions, namely, a top of the head, a nose, a neck, a left shoulder, a left elbow, a left wrist, a right shoulder, a right elbow, a right wrist, a back, a center of two hip, a left knee, a left ankle, a right hip, a right knee and a right ankle.
The terminal identifies the positions of the key points by using the method shown in fig. 1, and in this embodiment, the time-series two-dimensional key points of the front view are used as input to obtain the three-dimensional key point coordinates of each moment. The neural network identification and optimization process is shown in fig. 4, for example, the two-dimensional position of the key point m at time t at the left anterior angle video1 is
Figure BDA0002301483840000101
In the two-dimensional position of the front view video2
Figure BDA0002301483840000102
At the two-dimensional position of right anterior viewing angle video3
Figure BDA0002301483840000103
The input data of the neural network is
Figure BDA0002301483840000104
Neural network based on
Figure BDA0002301483840000105
Outputting the three-dimensional position of the key point m at the moment t
Figure BDA0002301483840000106
Then according to
Figure BDA0002301483840000107
Calculating the projection positions of the key points on the three visual angles according to the parameters of the camera
Figure BDA0002301483840000108
And
Figure BDA0002301483840000109
from which it is possible to calculate
Figure BDA00023014838400001010
And
Figure BDA00023014838400001011
the difference of,
Figure BDA00023014838400001012
And
Figure BDA00023014838400001013
the difference of,
Figure BDA00023014838400001014
And
Figure BDA00023014838400001015
so as to optimize the parameters of the neural network according to the difference.
The parameter initialization of the neural network may refer to the training scheme in the above embodiments, and is not described herein.
After the three-dimensional positions of all key points of the human body are obtained, the posture of the human body can be analyzed. The coordinates of the three-dimensional key points can represent the posture of the human body, the change condition of the data along with time can represent the motion state of the human body, and the data can be used for diagnosing or analyzing related diseases.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (9)

1. A method of identifying a three-dimensional position of an object, comprising:
acquiring a plurality of videos respectively shot by a plurality of camera devices on the same object;
determining two-dimensional positions of key points of the object in the plurality of videos respectively;
predicting the three-dimensional position of the key point according to the two-dimensional position by utilizing a neural network;
determining the projection positions of the key points in the imaging surfaces of the plurality of camera devices according to the three-dimensional positions and the parameters of the plurality of camera devices;
calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function.
2. The method according to claim 1, wherein in the step of predicting the three-dimensional position of the keypoint from the two-dimensional position using a neural network, the two-dimensional position of the keypoint in one of the videos is used as input data to the neural network, and the neural network is caused to output the three-dimensional position.
3. The method of claim 2, wherein the plurality of videos are videos captured by an odd number of closely spaced horizontally spaced cameras, and wherein the input data is taken from a video captured by a horizontally centered camera.
4. The method of claim 1, wherein determining the two-dimensional locations of the keypoints of the object in the plurality of videos respectively comprises:
respectively determining the areas of the objects in the plurality of videos by using the trained object detection network;
and respectively determining the two-dimensional positions of the key points in the region by utilizing the trained key point detection network.
5. The method according to claim 1, before acquiring a plurality of videos respectively photographed by a plurality of photographing devices on the same object, further comprising: parameters of the neural network are initialized by using training data, wherein the training data are a plurality of videos shot by a plurality of camera devices on the same object, and the videos comprise the process that the object is far away from or close to the camera devices.
6. The method according to claim 5, characterized in that the initialization is divided into two phases, in which the loss functions used are not identical.
7. The method of claim 6, wherein the loss function used in the first stage updates the parameters of the neural network with a first optimization goal of making the depth coordinates of the three-dimensional positions of the object keypoints in the training data output by the neural network positive;
the loss function used in the second stage updates the parameters of the neural network with a second optimization objective that reconciles the projected locations and the two-dimensional locations of the object keypoints in the training data on the basis of the first optimization objective.
8. The method of any one of claims 1-7, wherein the neural network is a long-short term memory network.
9. A system for identifying a three-dimensional position of an object, comprising:
a plurality of image pickup devices for respectively picking up videos of the same object;
a terminal for identifying three-dimensional locations of key points of an object according to the method of any one of claims 1-8.
CN201911223409.9A 2019-12-03 2019-12-03 Method and system for identifying three-dimensional position of object Active CN110910449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911223409.9A CN110910449B (en) 2019-12-03 2019-12-03 Method and system for identifying three-dimensional position of object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911223409.9A CN110910449B (en) 2019-12-03 2019-12-03 Method and system for identifying three-dimensional position of object

Publications (2)

Publication Number Publication Date
CN110910449A true CN110910449A (en) 2020-03-24
CN110910449B CN110910449B (en) 2023-10-13

Family

ID=69821806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911223409.9A Active CN110910449B (en) 2019-12-03 2019-12-03 Method and system for identifying three-dimensional position of object

Country Status (1)

Country Link
CN (1) CN110910449B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113916906A (en) * 2021-09-03 2022-01-11 江苏理工学院 LED light source illumination optimization method of visual detection system and used experimental equipment
CN114972958A (en) * 2022-07-27 2022-08-30 北京百度网讯科技有限公司 Key point detection method, neural network training method, device and equipment
CN115620094A (en) * 2022-12-19 2023-01-17 南昌虚拟现实研究院股份有限公司 Key point marking method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945269A (en) * 2017-12-26 2018-04-20 清华大学 Complicated dynamic human body object three-dimensional rebuilding method and system based on multi-view point video
CN108038420A (en) * 2017-11-21 2018-05-15 华中科技大学 A kind of Human bodys' response method based on deep video
WO2018121737A1 (en) * 2016-12-30 2018-07-05 北京市商汤科技开发有限公司 Keypoint prediction, network training, and image processing methods, device, and electronic device
CN109064549A (en) * 2018-07-16 2018-12-21 中南大学 Index point detection model generation method and mark point detecting method
CN109087329A (en) * 2018-07-27 2018-12-25 中山大学 Human body three-dimensional joint point estimation frame and its localization method based on depth network
CN110348371A (en) * 2019-07-08 2019-10-18 叠境数字科技(上海)有限公司 Human body three-dimensional acts extraction method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018121737A1 (en) * 2016-12-30 2018-07-05 北京市商汤科技开发有限公司 Keypoint prediction, network training, and image processing methods, device, and electronic device
CN108038420A (en) * 2017-11-21 2018-05-15 华中科技大学 A kind of Human bodys' response method based on deep video
CN107945269A (en) * 2017-12-26 2018-04-20 清华大学 Complicated dynamic human body object three-dimensional rebuilding method and system based on multi-view point video
CN109064549A (en) * 2018-07-16 2018-12-21 中南大学 Index point detection model generation method and mark point detecting method
CN109087329A (en) * 2018-07-27 2018-12-25 中山大学 Human body three-dimensional joint point estimation frame and its localization method based on depth network
CN110348371A (en) * 2019-07-08 2019-10-18 叠境数字科技(上海)有限公司 Human body three-dimensional acts extraction method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DARIO PAVLLO等: "3D human pose estimation in video with temporal convolutions and semi-supervised training", 《ARXIV》 *
DARIO PAVLLO等: "3D human pose estimation in video with temporal convolutions and semi-supervised training", 《ARXIV》, 29 March 2019 (2019-03-29) *
吴誉兰 等: "基于Kinect的动态手臂三维姿势的识别与仿真", 计算机仿真, no. 07 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113916906A (en) * 2021-09-03 2022-01-11 江苏理工学院 LED light source illumination optimization method of visual detection system and used experimental equipment
CN113916906B (en) * 2021-09-03 2024-01-09 江苏理工学院 LED light source illumination optimization method of visual detection system and experimental equipment used by method
CN114972958A (en) * 2022-07-27 2022-08-30 北京百度网讯科技有限公司 Key point detection method, neural network training method, device and equipment
CN115620094A (en) * 2022-12-19 2023-01-17 南昌虚拟现实研究院股份有限公司 Key point marking method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110910449B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
CN110321754B (en) Human motion posture correction method and system based on computer vision
US9330470B2 (en) Method and system for modeling subjects from a depth map
CN105531995B (en) System and method for using multiple video cameras to carry out object and event recognition
CN110544301A (en) Three-dimensional human body action reconstruction system, method and action training system
CN110910449B (en) Method and system for identifying three-dimensional position of object
CN111881887A (en) Multi-camera-based motion attitude monitoring and guiding method and device
US20200090408A1 (en) Systems and methods for augmented reality body movement guidance and measurement
CN110264493A (en) A kind of multiple target object tracking method and device under motion state
CN113239797B (en) Human body action recognition method, device and system
CN112668549B (en) Pedestrian attitude analysis method, system, terminal and storage medium
JP2017097577A (en) Posture estimation method and posture estimation device
CN113658211B (en) User gesture evaluation method and device and processing equipment
CN110544302A (en) Human body action reconstruction system and method based on multi-view vision and action training system
KR20080018642A (en) Remote emergency monitoring system and method
ElSayed et al. Ambient and wearable sensing for gait classification in pervasive healthcare environments
WO2022127181A1 (en) Passenger flow monitoring method and apparatus, and electronic device and storage medium
KR101995411B1 (en) Device and method for making body model
CN114120168A (en) Target running distance measuring and calculating method, system, equipment and storage medium
KR20180094554A (en) Apparatus and method for reconstructing 3d image
CN115035546A (en) Three-dimensional human body posture detection method and device and electronic equipment
CN116805433B (en) Human motion trail data analysis system
CN117238031A (en) Motion capturing method and system for virtual person
WO2019137186A1 (en) Food identification method and apparatus, storage medium and computer device
US20200226787A1 (en) Information processing apparatus, information processing method, and program
US20220395193A1 (en) Height estimation apparatus, height estimation method, and non-transitory computer readable medium storing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant