CN116503540A - Human body motion capturing, positioning and environment mapping method based on sparse sensor - Google Patents

Human body motion capturing, positioning and environment mapping method based on sparse sensor Download PDF

Info

Publication number
CN116503540A
CN116503540A CN202310484842.8A CN202310484842A CN116503540A CN 116503540 A CN116503540 A CN 116503540A CN 202310484842 A CN202310484842 A CN 202310484842A CN 116503540 A CN116503540 A CN 116503540A
Authority
CN
China
Prior art keywords
camera
human body
pose
state data
motion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310484842.8A
Other languages
Chinese (zh)
Inventor
徐枫
伊昕宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202310484842.8A priority Critical patent/CN116503540A/en
Publication of CN116503540A publication Critical patent/CN116503540A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/10Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration
    • G01C21/12Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration executed aboard the object being navigated; Dead reckoning
    • G01C21/16Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration executed aboard the object being navigated; Dead reckoning by integrating acceleration or speed, i.e. inertial navigation
    • G01C21/165Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration executed aboard the object being navigated; Dead reckoning by integrating acceleration or speed, i.e. inertial navigation combined with non-inertial navigation instruments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/05Geographic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/08Projecting images onto non-planar surfaces, e.g. geodetic screens
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Software Systems (AREA)
  • Geometry (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Computing Systems (AREA)
  • Automation & Control Theory (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human body motion capturing, positioning and environment mapping method based on a sparse sensor, which comprises the following steps: acquiring an inertial measurement value of an IMU sensor and a camera shooting image; solving based on the inertial measurement value to obtain human body state data and camera state data; carrying out re-projection optimization on camera state data and camera shooting images by using reconstructed sparse map points to obtain optimized camera pose and confidence; based on the human body state data and the optimized camera pose and confidence, human body position positioning and global motion correction are carried out to obtain final human body pose and position, and real-time rendering is carried out on the reconstructed sparse map points and the final human body pose and position to obtain a visual rendering result. The invention realizes the real-time simultaneous human body and environment perception based on the sparse wearable sensor for the first time, and greatly improves the positioning precision compared with the most advanced technology in two fields.

Description

Human body motion capturing, positioning and environment mapping method based on sparse sensor
Technical Field
The invention relates to the technical fields of computer graphics, computer vision, inertial sensor technology, human body motion capture, scene reconstruction and the like, in particular to a human body motion capture, positioning and environment mapping method based on a sparse sensor.
Background
Human body perception and environment perception are two important research problems in computer vision and graphics, and are widely applied to the fields of man-machine interaction, virtual/augmented reality and the like. Wherein human motion is typically captured by inertial sensors, while the environment is mainly reconstructed using cameras.
Human body perception and environment perception have important research significance and application value in the fields of computer vision and graphics. First, human perception techniques can help us capture human motions and actions, thereby enabling applications such as motion capture, gesture estimation, and human-computer interaction. For example, in game development, human perception techniques may be used to achieve realistic character actions, improving game fidelity and immersion. In the medical field, human body perception technology can be used in rehabilitation training, disease monitoring and other aspects. In addition, the human body perception technology can be applied to the fields of virtual reality, augmented reality, intelligent home and the like. And secondly, the environment perception technology can help us to establish a scene model, and realize three-dimensional reconstruction, scene analysis, intelligent navigation and other applications. For example, in the field of autopilot, context awareness technology may be used to achieve autonomous navigation and obstacle avoidance of a vehicle. In the field of robots, environmental awareness techniques may be used for autonomous exploration and operation of robots. In addition, the environment sensing technology can be applied to the fields of security monitoring, intelligent home, virtual reality and the like. Human perception and environmental perception are two indispensable and interdependent tasks, however, most of the prior art at home and abroad deal with them independently. Inertial motion capture, on the one hand, is prone to large displacement drift due to the lack of three-dimensional spatial positioning signals, and on the other hand, SLAM visual tracking often fails when visual features are poor.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent.
Therefore, the invention provides a human motion capturing, positioning and environment mapping method based on a sparse sensor, which obtains human motion in real time, performs positioning and map sparse point reconstruction, and provides powerful priori knowledge for camera motion by utilizing an inertial motion capturing technology. And the real-time simultaneous human body and environment perception based on the sparse wearable sensor is realized for the first time, and the positioning precision is greatly improved compared with the most advanced technology in the two fields.
Another object of the present invention is to provide a sparse sensor-based human motion capture, localization and environmental mapping system.
In order to achieve the above object, the present invention provides a method for capturing, locating and mapping environment of human body based on sparse sensor, comprising:
acquiring an inertial measurement value of an IMU sensor and a camera shooting image;
solving to obtain human body state data and camera state data based on the inertial measurement value;
carrying out re-projection optimization on the camera state data by using the reconstructed sparse map points and the camera shooting image to obtain optimized camera pose and confidence;
and based on the human body state data and the optimized camera pose and confidence, performing human body position positioning and global motion correction to obtain a final human body pose and position, and performing real-time rendering on the reconstructed sparse map points and the final human body pose and position to obtain a visual rendering result.
In addition, the sparse sensor-based human motion capturing, positioning and environment mapping method according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, after the acquiring the inertial measurement value of the IMU sensor and the capturing the image by the camera, the method further includes:
and carrying out coordinate system calibration of the inertial measurement value and time synchronization of the camera shooting image and the inertial measurement value based on the inertial measurement value of the IMU sensor and the camera shooting image, and obtaining a coordinate system calibration result and a time synchronization result.
Further, in one embodiment of the present invention, after the optimized camera pose and confidence are obtained, the method further includes:
and selecting a key frame according to the motion tracking state of the current frame based on a preset key frame selection scheme.
Further, in one embodiment of the present invention, taking the camera state data as an initial camera pose comprises:
extracting ORB characteristics in the image shot by the camera;
and performing feature matching on the ORB features and the reconstructed sparse map points by using feature similarity, and taking the camera state data as an initial camera pose based on feature matching results.
Further, in one embodiment of the present invention, the method further comprises:
performing motion capture constrained beam adjustment optimization based on the human body state data and the key frames to simultaneously optimize the position and camera pose of the reconstructed sparse map points; the method comprises the steps of,
and detecting a human track closed loop, and performing inertial motion capture-assisted pose map optimization based on a detection result to obtain the optimized sparse map point position and key frame pose.
In order to achieve the above object, another aspect of the present invention provides a human motion capturing, positioning and environment mapping system based on a sparse sensor, comprising:
the data acquisition module is used for acquiring an inertial measurement value of the IMU sensor and a camera shooting image;
the data solving module is used for solving and obtaining human body state data and camera state data based on the inertia measured value;
the re-projection optimization module is used for carrying out re-projection optimization on the camera state data by utilizing the reconstructed sparse map points and the camera shooting image to obtain an optimized camera pose and confidence;
and the correction positioning rendering module is used for performing human body position positioning and global motion correction to obtain final human body posture and position based on the human body state data and the optimized camera posture and confidence degree, and performing real-time rendering on the reconstructed sparse map points and the final human body posture and position to obtain a visual rendering result.
According to the human body motion capturing, positioning and environment mapping method and system based on the sparse sensor, good balance is achieved between map point constraint and motion capturing constraint, uncertainty in tracking is effectively reduced, and positioning accuracy is improved. And the invention obtains win-win results by fusing inertial motion capturing and simultaneous positioning and mapping technologies.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a sparse sensor based human motion capture, localization and environmental mapping method in accordance with an embodiment of the present invention;
FIG. 2 is a frame diagram of a sparse sensor based human motion capture, localization, and environmental mapping method in accordance with an embodiment of the present invention;
FIG. 3 is a schematic illustration of beam adjustment optimization for motion capture constraints in accordance with an embodiment of the present invention;
FIG. 4 is an exemplary diagram of sparse point reconstruction of a virtual scene in accordance with an embodiment of the present invention;
FIG. 5 is a diagram showing an example of practical application according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a human motion capturing, positioning and environmental mapping system based on sparse sensors according to an embodiment of the present invention.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
The human body motion capturing, positioning and environment mapping method and system based on the sparse sensor according to the embodiment of the invention are described below with reference to the accompanying drawings.
The present invention recognizes that the combined perception of human motion and environment is important for human interaction with the environment. First, the simultaneous perception of the human body and the environment may improve the efficiency and security of human interaction with the environment. For example, in an automatically driven automobile, sensing the behavior of the driver and the surrounding environment at the same time can better ensure the safety and smoothness of driving. Secondly, human body and environment are perceived simultaneously, so that higher-level human-computer interaction can be realized. For example, in virtual reality and augmented reality, perceiving both the user's actions and the surrounding environment may better enable an immersive experience. The human body and environment can be seen to be perceived simultaneously, so that more efficient, safer and more intelligent human-computer interaction and environment application experience can be brought to us.
FIG. 1 is a flow chart of a sparse sensor based human motion capture, localization and environmental mapping method in accordance with an embodiment of the present invention.
As shown in fig. 1, the method includes, but is not limited to, the steps of:
s1, acquiring an inertial measurement value of an IMU sensor and a camera shooting image;
s2, solving to obtain human body state data and camera state data based on the inertial measurement value;
s3, performing re-projection optimization on the camera state data by using the reconstructed sparse map points and the camera shooting images to obtain optimized camera pose and confidence;
and S4, based on the human body state data and the optimized camera pose and confidence, performing human body position positioning and global motion correction to obtain a final human body pose and position, and performing real-time rendering on the reconstructed sparse map points and the final human body pose and position to obtain a visual rendering result.
It can be appreciated that the invention uses only 6 IMU sensors and 1 monocular color camera for real-time human motion capture, human localization and ambient sparse point reconstruction. The 6 IMUs are worn on the lower arms, the lower legs, the head and the back of a human body, and the camera is fixed in front of the forehead of the human body and shoots outwards. Each IMU sensor measures 60 frames per second of orientation and acceleration information and the camera takes pictures at 30 frames per second. The method comprises human body motion capturing, camera tracking, map reconstruction, closed loop detection and human body motion updating, and the overall structure is shown in figure 2. The invention provides a deep coupling framework to fully utilize the complementary advantages of sparse inertial motion capture and SLAM technology. In this framework, human motion prior is combined with multiple key components of SLAM, and the positioning result of SLAM is also fed back to human motion capture. By jointly optimizing the camera gesture and the sparse map point position and combining the information captured by human body actions, the tracking and map construction precision and robustness are improved. When the visual features are reliable, SLAM may use the environmental information to correct drift in inertial motion capture; when visual features are poor due to camera occlusion or extreme illumination, inertial motion capture can provide pose and displacement estimates for the SLAM system, avoiding the situation of complete failure of previous SLAM systems. In addition, the present invention proposes novel map point confidence levels that facilitate the dynamic determination of the importance of each map point in the system key algorithm beam adjustment method (bundle adjustment). By reducing the influence of potential erroneous map points, the method of the invention realizes good balance between map point constraint and motion capture constraint, effectively reduces uncertainty in tracking and improves positioning accuracy. The invention obtains win-win results by integrating inertial motion capturing and simultaneous positioning and mapping technologies.
As shown in FIG. 2, the human body motion capturing, positioning and environment mapping method based on the sparse sensor of the invention specifically comprises the following steps:
and step 1, acquiring an inertial measurement value of the IMU and a color picture shot by the camera, and performing coordinate system calibration of inertial data and time synchronization of images and the inertial data.
And 2, capturing human body motions, and solving initial human body gestures, motions and root node accelerations, as well as camera gestures and motions by utilizing inertial measurement values of the IMU.
And 3, using the camera tracking, taking the camera pose and the motion information obtained in the step 2 as initial camera pose information, carrying out re-projection optimization on the camera pose by utilizing the color image shot by the camera and the reconstructed sparse map points, obtaining the optimized camera pose and confidence, and selecting a proper key frame according to the motion tracking condition of each frame.
And 4, performing sparse map point reconstruction and closed loop detection optimization through the key frames selected in the step 3 by using map reconstruction and closed loop detection. A beam adjustment method (bundle adjustment) algorithm is operated in the module to simultaneously optimize map point positions and camera poses, human body kinematics priori knowledge and map point confidence are introduced into the algorithm, and the optimization process is dynamically constrained by utilizing the motion capturing result. And when the human body motion is closed-loop, performing pose diagram optimization of inertial motion capture constraint. And finally obtaining the optimized sparse map points and key frame pose.
And 5, updating human body movement, and positioning the human body position and correcting global movement by using the acceleration and the human body posture of the human body root node obtained in the step 2 and the optimized camera pose and confidence obtained in the step 3 to obtain the accurate human body position.
And 6, rendering the reconstructed sparse map points in the step 4 and the human body posture and position obtained in the step 5 in real time, and visualizing a final result.
The human body motion capturing, positioning and environment mapping method based on the sparse sensor in the embodiment of the invention is explained in detail below with reference to the accompanying drawings.
In one embodiment of the invention, inertial measurement values of the IMU and color pictures shot by the camera are obtained, and coordinate system calibration of the inertial data is performed to eliminate the influence of different orientation of the sensor on a human body and align the data to a fixed global coordinate system for solving the human body posture; the time of the image and the inertial data are synchronized to match the image data and the inertial data at the corresponding time.
Specifically, the frame rate at which 6 IMUs acquire acceleration and angle measurements is 60Hz, and the frame rate at which cameras acquire pictures is 30Hz. In order to synchronize the time of the IMU and the camera, the human body is first required to wear all the sensors and perform a jump motion. By detecting the peak acceleration values measured by the IMUs and the abrupt moments of the camera images, the time stamps of the two are aligned and the measurements of the 6 IMUs are synchronized. After synchronization, the raw inertial measurements are taken at 60Hz and one frame of camera pictures are taken every other frame. And then, performing a T-phase sensor calibration process, wherein the step is consistent with the inertial motion capture work, and the IMU measured values can be aligned to a motion capture global coordinate system after the completion of the inertial motion capture work so as to facilitate the gesture solving by using a method in the motion capture work. Finally, requiring the person to be captured to walk an arc line, and obtaining the human motion trail of two arc lines under a SLAM global coordinate system and a motion capture global coordinate system by independently running a monocular simultaneous localization and map reconstruction (SLAM) algorithm and a sparse inertial sensor motion capture technology. And (3) obtaining the alignment relation between the SLAM and the motion capture global coordinate system and the map scale factor of the SLAM system by aligning the two sections of tracks. The following steps use two coordinate systems respectively, and related data can be transformed by the alignment matrix obtained in the step to be unified into the same coordinate system, which is not described in detail later. The trajectory alignment process may be implemented by an iterative closest point algorithm (Iterative Closest Point, ICP) with known matching relationships.
Further, human motion capture is used: and solving the initial human body posture, the motion and the root node acceleration, and the camera posture and the motion by using the inertial measurement value of the IMU. The human motion capture method based on the sparse inertial sensor is realized by using the PIP algorithm. But unlike PIP algorithms, which assume that the scene is only a flat ground, the present invention removes this assumption because the objective of the present invention is to achieve free motion capture in 3D space. Based on this, the present invention removes the force calculation and redesigns the dynamics optimizer in the PIP algorithm because we do not know the dense geometry of the scene (reconstruction is only done in sparse map points) and therefore cannot detect human-to-environment collisions to apply force on the contact points.
In one embodiment of the invention, the invention uses a motion estimation algorithm consistent with PIP, i.e., a multi-stage recurrent neural network is used to predict the motion state of the human body; its dynamics optimization algorithm is modified so that it supports free human motion in 3D space. Along with the symbols in PIP, the present invention defines a new dynamics optimization algorithm as:
wherein,,for human body posture and movement acceleration->And->Target angular acceleration and target angular acceleration given by a dual PD controller in PIP algorithmLinear acceleration, J is human joint jacobian matrix,>the joint linear velocity is C, which is a joint linear velocity constraint, described in detail below. The variables to be optimized in the optimization are body posture and motion acceleration +.>The optimal acceleration is solved by the optimization, so that the motion of a physical human body accords with an optimal control scheme given by a dual PD controller. Wherein C is defined as:
where σ=1e-3 is a sufficiently small speed limit, n j The superscript x, z represents the x, z component of the three-dimensional vector for the number of collision joints in the algorithm. For specific details reference may be made to the PIP algorithm. Compared to PIP, the present invention removes the limitation of vertical speed, thereby enabling free motion in three-dimensional space. The solution method of the optimization and the complete human body posture and motion estimation are known. After the human body posture and the motion are obtained, the camera pose is obtained through a human body forward kinematics algorithm and is used for subsequent operation.
Further, camera tracking is used: taking the camera pose and the motion information obtained in the step 2 as initial camera pose information, carrying out re-projection optimization on the camera pose by utilizing the color image shot by the camera and the reconstructed sparse map points in the step 4, obtaining the optimized camera pose and the confidence coefficient, updating the human body position in the subsequent step, and selecting a proper key frame according to the motion tracking condition of the current frame.
In one embodiment of the present invention, this step is designed based on ORB-SLAM3, first extracted from the color imageAnd (3) performing feature matching on the ORB features and the reconstructed sparse map points in the step (4) by using feature similarity to obtain matched 2D-3D point pairs. The world coordinates of the map points are recorded asThe pixel sitting mark of the 2D image characteristic point matched with the pixel sitting mark isWhere i ε X represents all matching relationships. Use->Representing the initial camera pose before optimization, and optimizing the camera pose R and t under the constraint condition of motion capture:
wherein ρ (·) is a robust Huber kernel, Σ i For a feature point scale dependent covariance matrix, depending on at which level of the image pyramid the map is detected, log: SO (3) →R 3 Mapping three-dimensional rotation defined in the plum cluster to a three-dimensional vector space, wherein pi (·) is three-dimensional projection operation of a pinhole camera model, and lambda R And lambda (lambda) t And capturing control coefficients of rotation and translation items for the motion. The optimization optimizes the camera pose by minimizing the re-projection error from the matched 3D map points to the 2D feature points of the image and enabling the camera pose to meet the pose constraint provided by human motion capture. The invention performs the optimization 3 times, wherein each time, the matching is classified as the correct matching according to the reprojection error of each pointAnd error matching, only correct matching is used in the next optimization, and the error matching is deleted. Through the strong priori knowledge provided by the motion capture constraint, the algorithm can better distinguish between correct and incorrect matches, thereby improving the tracking precision of the camera. Coefficient values of motion capture constraint terms lambda R =0.01 2 And lambda (lambda) t =0.5f 2 s 2 Where f is the camera focal length and s is the SLAM coordinate system scale factor obtained in step 2. By scaling the camera focal length and scale factor, the motion capture constrained error is unified to pixel units consistent with the re-projection error, thus making the optimization unaffected by the camera focal length or scale factor. The optimization problem is solved by the module through the Levenberg-Marquardt algorithm, and real-time performance is guaranteed by accelerating calculation through a g2o graph optimization technology. After the pose of the camera is solved, the module extracts the pair number n of correctly matched map points and takes the pair number n as the confidence of the pose.
It will be appreciated that in practice, the absolute position of the camera obtained by pure inertial motion capture is typically subject to substantial drift due to inertial error accumulation. Therefore, it is not feasible to use it directly for camera tracking optimization for longer sequences, as such nonlinear optimization usually requires a good initialization. Thus, the present invention performs one camera pose alignment step for each frame to eliminate drift before optimization.
Specifically, the present invention calculates the relative rotation and displacement of the camera from motion capture and adds it to the camera pose that was previously SLAM optimized. The invention uses R, t to represent the orientation and position of the camera, and the camera gesture extracted from motion capture is represented asThe camera pose alignment operation is:
wherein the subscript cur represents the current frame and the subscript last represents the last frame.
At the end of this step, the present invention selects key frames for subsequent map reconstruction and closed loop detection steps. The key frame selection scheme is a known scheme. But the invention captures the initial camera pose obtained by inertial motion at the same timeIncluded within the key frame for subsequent beam adjustment algorithms and closed loop detection algorithms.
Further, map reconstruction and closed loop detection are used: and (3) reconstructing sparse map points and optimizing closed loop detection through the key frames selected in the step (3). In this step, a beam adjustment method (Bundle Adjustment, BA) algorithm is run to simultaneously optimize sparse map point positions and camera poses, and to introduce human kinematics priori knowledge and map point confidence levels set forth below into the algorithm, dynamically constraining the optimization process with the human poses and motions estimated in step 2. And when the human body motion is closed-loop, performing inertial motion capture-assisted pose diagram optimization. And finally, obtaining the optimized sparse map point positions and key frame poses, and operating BA and closed-loop detection algorithm for the next frame. Sparse map point locations and keyframe poses are maintained and updated by the system.
Specifically, the present embodiment first assigns a confidence value to each map point that is to participate in BA optimization. Confidence c of ith map point i Calculated by the following formula:
c i =b i θ i ,
wherein b i To observe the furthest distance of all keyframes of the map point, it is analogous to the baseline length (baseline) in three-dimensional reconstruction; θ i The maximum included angle of the sight direction of the map point is observed; k is a superparameter chosen to be 50 such that the average value of the confidence approaches 1. The confidence of the map point is that when the key of the map point is observedThe frame is located in a sufficiently large range and the viewing angle is sufficiently large, the map point position is considered to be more accurate. This confidence is used for the subsequent BA optimization steps.
Then, the motion capture constraint beam adjustment method optimization is performed. As shown in fig. 3, the map point positions and the last 20 key frame camera poses are optimized simultaneously, while the other key frame camera poses that see these map points are fixed in the optimization. Note all optimizable keyframe sets as K o All fixed keyframes are assembled as K f The set of map points observed by keyframe j is denoted as X j . R is recorded o =R j |j∈K o },T o =t j |j∈K o The key frame orientation and three-dimensional position which need to be optimized are recordedRepresenting the location of map points in the world. The beam adjustment method optimization of the motion capture constraint is defined as:
where prev () represents the last key frame, μ, of key frame j R =0.01 2 Sum mu t =0.05 2 s 2 For the coefficients of the motion capture constraint, its units are aligned to a pixel scale consistent with the re-projection error. The optimization requires that the re-projection error of the map points be small and that each key frameThe rotation and the relative position of the motion capture device are similar to the result of motion capture. By utilizing human motion priori provided by motion capture, more accurate simultaneous optimization of map points and camera poses is realized, wherein the confidence degree c of the map points i The relative weight relation between the motion capture constraint item and the map point weight projection item is dynamically determined. The optimized solution algorithm is consistent with step 3, wherein map point locations need to be marginalized to accelerate the optimization algorithm.
When the track closed loop (i.e. the human body returns to the position which has been reached before) is detected, closed loop optimization is needed, and the closed loop optimization is designed, wherein the pose graph optimization algorithm is modified, so that the prior knowledge of the camera pose provided by motion capture is integrated. The vertex set in the pose graph is F, and the edge set is C.
In one embodiment of the present invention, the pose map optimization of the motion capture constraint proposed by the present invention is defined as:
wherein T is j E SE (3) is the pose of key frame j, T ij E SE (3) optimizes the relative pose between key frames i and j prior to the pose map,for capturing motion, obtaining initial value of camera pose, log: SE (3) →R 6 Mapping pose to six-dimensional vector space, ω pose The relative coefficient of the motion capture constraint is taken to be 0.2.
Further, human motion update is used: and (3) positioning the human body position and correcting global motion by using the acceleration and the human body posture of the human body root node obtained in the step (2) and the optimized camera pose and confidence obtained in the step (3) to obtain an accurate human body position. The camera tracking module takes both visual and inertial information into account, so that the output 30 frames per second camera position can fine tune the motion of the human body from the inertial motion capture module at 60 frames per second. The present invention uses a prediction-correction algorithm in a Kalman filter to implement the module.
Specifically, the following state transition equation can be defined by noting that the global position and velocity of the human body are p, v, and the global acceleration extracted from the motion capture module is a:
p kk-1 + k-1 Δt
v kk-1 + k-1 Δt+q k-1 ,
wherein the subscript k denotes the kth frame, Δt=1/60 is the time interval between frames, q to N (0, σ) 2 I) Modeling motion capture prediction errors, the invention sets a variance term sigma=Δt, which means that the acceleration variance estimated by the network is assumed to be 1. Recording the optimized camera position in step 3 as p cam With a confidence level of n, the state observation equation of 30Hz is defined as:
wherein the method comprises the steps ofThe three-dimensional position difference of the camera and the root node is calculated by a human forward kinematics algorithm in motion capture. r to N (0, sigma) cam ) Modeling errors in camera tracking, where covariance matrix Σ cam The following is approximated as a diagonal array:
where e=10 -3 Avoiding divisor 0, n being camera pose confidence calculated in step 3. According to the state transition equation and the state observation equation, the invention predicts the global position of the human body by using a Kalman filtering algorithm as output.
Further, rendering the reconstructed sparse map points in the step 4 and the human body posture and position obtained in the step 5 in real time, and visualizing a final result. FIG. 4 is an example of the results of the present invention running in a virtual scene disclosing an inertial motion capture dataset, where reconstructed sparse map points are rendered. Fig. 5 shows an actual operation example of the present invention, in which the motion of an actual human body is shown on the left side, and the human body posture, the human body motion track, and the reconstructed sparse map points estimated by the system are shown on the right side.
In summary, the invention is based on human motion, usually captured by inertial sensors, while the environment is reconstructed mainly using cameras. The two technologies are integrated together, a method for simultaneously executing human motion capture, positioning and environment mapping is developed, human motion is obtained in real time, and positioning and map sparse point reconstruction are performed.
In summary, the present invention provides a technology for capturing, positioning and environmental mapping of human body actions in real time while using only 6 Inertial Sensors (IMUs) and 1 monocular color camera for the first time. While inertial motion capture (mocap) techniques explore "internal" information such as human motion signals and motion priors, simultaneous localization and mapping (SLAM) techniques rely primarily on "external" information, i.e., the environment captured by the camera. The former has good stability, but global position drift accumulates over long movements due to no external correct reference; the latter can estimate the global position in the scene with high accuracy, but tracking loss easily occurs when the environmental information is unreliable (e.g. no texture or occlusion is present). Therefore, the invention effectively combines the two complementary technologies (mocap and SLAM), and realizes the stable and accurate human body positioning and map reconstruction by fusing human body motion priori and visual tracking on a plurality of key algorithms. The 6 IMUs used in the invention are worn on the limbs, the head and the back of a person, and the monocular color camera is fixed on the head and shoots outwards. This design is inspired by the real human behaviour: when humans are in a new environment, they acquire information of the scene with their eyes and plan their actions in the scene. The monocular camera acts as an eye for a human, providing real-time scene reconstruction and self-positioning visual signals for the present technology, while the IMU measures the motion of the human's most important, i.e. limbs and head, in the scene. The whole system realizes simultaneous human motion capture and environment sparse point reconstruction based on only 6 IMUs and 1 camera for the first time, the running speed reaches 60fps on a CPU, and the most advanced technology in two fields is simultaneously exceeded in precision.
According to the human body motion capturing, positioning and environment mapping method based on the sparse sensor, human body motion is obtained in real time, positioning and map sparse point reconstruction are carried out, real-time simultaneous human body and environment perception based on the sparse wearable sensor is realized for the first time, and positioning accuracy is greatly improved compared with the most advanced technology in two fields.
In order to implement the above embodiment, as shown in fig. 6, a human motion capturing, positioning and environment mapping system 10 based on sparse sensors is further provided in this embodiment, where the system 10 includes a data acquisition module 100, a data solving module 200, a re-projection optimizing module 300, and a corrective positioning rendering module 400.
The data acquisition module 100 is used for acquiring inertial measurement values of the IMU sensor and camera shooting images;
the data solving module 200 is used for solving to obtain human body state data and camera state data based on the inertial measurement value;
the re-projection optimization module 300 is configured to perform re-projection optimization on the camera state data by using the reconstructed sparse map points and the camera-shot image to obtain an optimized camera pose and confidence;
the correction positioning rendering module 400 is configured to perform human body position positioning and global motion correction to obtain a final human body posture and position based on the human body state data and the optimized camera pose and confidence, and perform real-time rendering on the reconstructed sparse map points and the final human body posture and position to obtain a visual rendering result.
Further, after the data acquisition module 100, the method further includes:
and the data synchronization module is used for carrying out coordinate system calibration of the inertial measurement value based on the inertial measurement value of the IMU sensor and the camera shooting image and carrying out time synchronization of the camera shooting image and the inertial measurement value to obtain a coordinate system calibration result and a time synchronization result.
Further, after the above-mentioned re-projection optimization module 300, a key frame selection module is further included,
and the key frame selection module is used for selecting the key frame according to the motion tracking state of the current frame based on a preset key frame selection scheme.
Further, the re-projection optimization module 300 is further configured to take the camera status data as an initial camera pose, and includes:
extracting ORB characteristics in an image shot by a camera;
and performing feature matching on the ORB features and the reconstructed sparse map points by using feature similarity, and taking the camera state data as an initial camera pose based on feature matching results.
Further, after the key frame selection module, the method further comprises a map reconstruction and closed loop detection module for:
performing motion capture constrained beam adjustment optimization based on human body state data and key frames to simultaneously optimize the position and camera pose of the reconstructed sparse map points; the method comprises the steps of,
and detecting a human track closed loop, and performing inertial motion capture-assisted pose map optimization based on a detection result to obtain the optimized sparse map point position and key frame pose.
According to the human motion capturing, positioning and environment mapping system based on the sparse sensor, human motion is obtained in real time, positioning and map sparse point reconstruction are performed, real-time simultaneous human and environment perception based on the sparse wearable sensor is realized for the first time, and positioning accuracy is greatly improved compared with the most advanced technology in two fields.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Claims (10)

1. A human body motion capturing, positioning and environment mapping method based on a sparse sensor is characterized by comprising the following steps:
acquiring an inertial measurement value of an IMU sensor and a camera shooting image;
solving to obtain human body state data and camera state data based on the inertial measurement value;
carrying out re-projection optimization on the camera state data by using the reconstructed sparse map points and the camera shooting image to obtain optimized camera pose and confidence;
and based on the human body state data and the optimized camera pose and confidence, performing human body position positioning and global motion correction to obtain a final human body pose and position, and performing real-time rendering on the reconstructed sparse map points and the final human body pose and position to obtain a visual rendering result.
2. The method of claim 1, wherein after the acquiring inertial measurements of the IMU sensor and the camera capturing the image, the method further comprises:
and carrying out coordinate system calibration of the inertial measurement value and time synchronization of the camera shooting image and the inertial measurement value based on the inertial measurement value of the IMU sensor and the camera shooting image, and obtaining a coordinate system calibration result and a time synchronization result.
3. The method of claim 1, wherein after the optimized camera pose and confidence is obtained, the method further comprises:
and selecting a key frame according to the motion tracking state of the current frame based on a preset key frame selection scheme.
4. The method of claim 1, wherein taking the camera state data as an initial camera pose comprises:
extracting ORB characteristics in the image shot by the camera;
and performing feature matching on the ORB features and the reconstructed sparse map points by using feature similarity, and taking the camera state data as an initial camera pose based on feature matching results.
5. A method according to claim 3, characterized in that the method further comprises:
performing motion capture constrained beam adjustment optimization based on the human body state data and the key frames to simultaneously optimize the position and camera pose of the reconstructed sparse map points; the method comprises the steps of,
and detecting a human track closed loop, and performing inertial motion capture-assisted pose map optimization based on a detection result to obtain the optimized sparse map point position and key frame pose.
6. A human motion capture, localization and environmental mapping system based on sparse sensors, comprising:
the data acquisition module is used for acquiring an inertial measurement value of the IMU sensor and a camera shooting image;
the data solving module is used for solving and obtaining human body state data and camera state data based on the inertia measured value;
the re-projection optimization module is used for carrying out re-projection optimization on the camera state data by utilizing the reconstructed sparse map points and the camera shooting image to obtain an optimized camera pose and confidence;
and the correction positioning rendering module is used for performing human body position positioning and global motion correction to obtain final human body posture and position based on the human body state data and the optimized camera posture and confidence degree, and performing real-time rendering on the reconstructed sparse map points and the final human body posture and position to obtain a visual rendering result.
7. The system of claim 6, further comprising, after the data acquisition module:
and the data synchronization module is used for carrying out coordinate system calibration of the inertial measurement value based on the inertial measurement value of the IMU sensor and the camera shooting image and carrying out time synchronization of the camera shooting image and the inertial measurement value to obtain a coordinate system calibration result and a time synchronization result.
8. The system of claim 6, further comprising a keyframe selection module after the re-projection optimization module,
the key frame selection module is used for selecting the key frame according to the motion tracking state of the current frame based on a preset key frame selection scheme.
9. The system of claim 6, wherein the re-projection optimization module is further configured to take the camera state data as an initial camera pose, comprising:
extracting ORB characteristics in the image shot by the camera;
and performing feature matching on the ORB features and the reconstructed sparse map points by using feature similarity, and taking the camera state data as an initial camera pose based on feature matching results.
10. The system of claim 8, further comprising, after the key frame selection module, a map reconstruction and closed loop detection module to:
performing motion capture constrained beam adjustment optimization based on the human body state data and the key frames to simultaneously optimize the position and camera pose of the reconstructed sparse map points; the method comprises the steps of,
and detecting a human track closed loop, and performing inertial motion capture-assisted pose map optimization based on a detection result to obtain the optimized sparse map point position and key frame pose.
CN202310484842.8A 2023-04-28 2023-04-28 Human body motion capturing, positioning and environment mapping method based on sparse sensor Pending CN116503540A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310484842.8A CN116503540A (en) 2023-04-28 2023-04-28 Human body motion capturing, positioning and environment mapping method based on sparse sensor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310484842.8A CN116503540A (en) 2023-04-28 2023-04-28 Human body motion capturing, positioning and environment mapping method based on sparse sensor

Publications (1)

Publication Number Publication Date
CN116503540A true CN116503540A (en) 2023-07-28

Family

ID=87326214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310484842.8A Pending CN116503540A (en) 2023-04-28 2023-04-28 Human body motion capturing, positioning and environment mapping method based on sparse sensor

Country Status (1)

Country Link
CN (1) CN116503540A (en)

Similar Documents

Publication Publication Date Title
CN109307508B (en) Panoramic inertial navigation SLAM method based on multiple key frames
Qin et al. Vins-mono: A robust and versatile monocular visual-inertial state estimator
CN109166149B (en) Positioning and three-dimensional line frame structure reconstruction method and system integrating binocular camera and IMU
CN110125928B (en) Binocular inertial navigation SLAM system for performing feature matching based on front and rear frames
CN109993113B (en) Pose estimation method based on RGB-D and IMU information fusion
WO2019157925A1 (en) Visual-inertial odometry implementation method and system
US10852847B2 (en) Controller tracking for multiple degrees of freedom
CN111258313A (en) Multi-sensor fusion SLAM system and robot
US20130335529A1 (en) Camera pose estimation apparatus and method for augmented reality imaging
Alcantarilla et al. Visual odometry priors for robust EKF-SLAM
Gemeiner et al. Simultaneous motion and structure estimation by fusion of inertial and vision data
Saini et al. Markerless outdoor human motion capture using multiple autonomous micro aerial vehicles
CN110726406A (en) Improved nonlinear optimization monocular inertial navigation SLAM method
CN111932674A (en) Optimization method of line laser vision inertial system
Shamwell et al. Vision-aided absolute trajectory estimation using an unsupervised deep network with online error correction
CN116222543B (en) Multi-sensor fusion map construction method and system for robot environment perception
CN109242887A (en) A kind of real-time body's upper limks movements method for catching based on multiple-camera and IMU
CN111353355A (en) Motion tracking system and method
CN114485640A (en) Monocular vision inertia synchronous positioning and mapping method and system based on point-line characteristics
WO2024094227A1 (en) Gesture pose estimation method based on kalman filtering and deep learning
CN111489392B (en) Single target human motion posture capturing method and system in multi-person environment
Huai et al. Real-time large scale 3D reconstruction by fusing Kinect and IMU data
TW202314593A (en) Positioning method and equipment, computer-readable storage medium
KR102456872B1 (en) System and method for tracking hand motion using strong coupling fusion of image sensor and inertial sensor
CN112907633A (en) Dynamic characteristic point identification method and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination