CN116503540A

CN116503540A - Human body motion capturing, positioning and environment mapping method based on sparse sensor

Info

Publication number: CN116503540A
Application number: CN202310484842.8A
Authority: CN
Inventors: 徐枫; 伊昕宇
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-07-28

Abstract

The invention discloses a human body motion capturing, positioning and environment mapping method based on a sparse sensor, which comprises the following steps: acquiring an inertial measurement value of an IMU sensor and a camera shooting image; solving based on the inertial measurement value to obtain human body state data and camera state data; carrying out re-projection optimization on camera state data and camera shooting images by using reconstructed sparse map points to obtain optimized camera pose and confidence; based on the human body state data and the optimized camera pose and confidence, human body position positioning and global motion correction are carried out to obtain final human body pose and position, and real-time rendering is carried out on the reconstructed sparse map points and the final human body pose and position to obtain a visual rendering result. The invention realizes the real-time simultaneous human body and environment perception based on the sparse wearable sensor for the first time, and greatly improves the positioning precision compared with the most advanced technology in two fields.

Description

Human body motion capturing, positioning and environment mapping method based on sparse sensor

Technical Field

The invention relates to the technical fields of computer graphics, computer vision, inertial sensor technology, human body motion capture, scene reconstruction and the like, in particular to a human body motion capture, positioning and environment mapping method based on a sparse sensor.

Background

Human body perception and environment perception are two important research problems in computer vision and graphics, and are widely applied to the fields of man-machine interaction, virtual/augmented reality and the like. Wherein human motion is typically captured by inertial sensors, while the environment is mainly reconstructed using cameras.

Human body perception and environment perception have important research significance and application value in the fields of computer vision and graphics. First, human perception techniques can help us capture human motions and actions, thereby enabling applications such as motion capture, gesture estimation, and human-computer interaction. For example, in game development, human perception techniques may be used to achieve realistic character actions, improving game fidelity and immersion. In the medical field, human body perception technology can be used in rehabilitation training, disease monitoring and other aspects. In addition, the human body perception technology can be applied to the fields of virtual reality, augmented reality, intelligent home and the like. And secondly, the environment perception technology can help us to establish a scene model, and realize three-dimensional reconstruction, scene analysis, intelligent navigation and other applications. For example, in the field of autopilot, context awareness technology may be used to achieve autonomous navigation and obstacle avoidance of a vehicle. In the field of robots, environmental awareness techniques may be used for autonomous exploration and operation of robots. In addition, the environment sensing technology can be applied to the fields of security monitoring, intelligent home, virtual reality and the like. Human perception and environmental perception are two indispensable and interdependent tasks, however, most of the prior art at home and abroad deal with them independently. Inertial motion capture, on the one hand, is prone to large displacement drift due to the lack of three-dimensional spatial positioning signals, and on the other hand, SLAM visual tracking often fails when visual features are poor.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems in the related art to some extent.

Therefore, the invention provides a human motion capturing, positioning and environment mapping method based on a sparse sensor, which obtains human motion in real time, performs positioning and map sparse point reconstruction, and provides powerful priori knowledge for camera motion by utilizing an inertial motion capturing technology. And the real-time simultaneous human body and environment perception based on the sparse wearable sensor is realized for the first time, and the positioning precision is greatly improved compared with the most advanced technology in the two fields.

Another object of the present invention is to provide a sparse sensor-based human motion capture, localization and environmental mapping system.

In order to achieve the above object, the present invention provides a method for capturing, locating and mapping environment of human body based on sparse sensor, comprising:

acquiring an inertial measurement value of an IMU sensor and a camera shooting image;

solving to obtain human body state data and camera state data based on the inertial measurement value;

carrying out re-projection optimization on the camera state data by using the reconstructed sparse map points and the camera shooting image to obtain optimized camera pose and confidence;

and based on the human body state data and the optimized camera pose and confidence, performing human body position positioning and global motion correction to obtain a final human body pose and position, and performing real-time rendering on the reconstructed sparse map points and the final human body pose and position to obtain a visual rendering result.

In addition, the sparse sensor-based human motion capturing, positioning and environment mapping method according to the above embodiment of the present invention may further have the following additional technical features:

further, in one embodiment of the present invention, after the acquiring the inertial measurement value of the IMU sensor and the capturing the image by the camera, the method further includes:

and carrying out coordinate system calibration of the inertial measurement value and time synchronization of the camera shooting image and the inertial measurement value based on the inertial measurement value of the IMU sensor and the camera shooting image, and obtaining a coordinate system calibration result and a time synchronization result.

Further, in one embodiment of the present invention, after the optimized camera pose and confidence are obtained, the method further includes:

and selecting a key frame according to the motion tracking state of the current frame based on a preset key frame selection scheme.

Further, in one embodiment of the present invention, taking the camera state data as an initial camera pose comprises:

extracting ORB characteristics in the image shot by the camera;

and performing feature matching on the ORB features and the reconstructed sparse map points by using feature similarity, and taking the camera state data as an initial camera pose based on feature matching results.

Further, in one embodiment of the present invention, the method further comprises:

performing motion capture constrained beam adjustment optimization based on the human body state data and the key frames to simultaneously optimize the position and camera pose of the reconstructed sparse map points; the method comprises the steps of,

and detecting a human track closed loop, and performing inertial motion capture-assisted pose map optimization based on a detection result to obtain the optimized sparse map point position and key frame pose.

In order to achieve the above object, another aspect of the present invention provides a human motion capturing, positioning and environment mapping system based on a sparse sensor, comprising:

the data acquisition module is used for acquiring an inertial measurement value of the IMU sensor and a camera shooting image;

the data solving module is used for solving and obtaining human body state data and camera state data based on the inertia measured value;

the re-projection optimization module is used for carrying out re-projection optimization on the camera state data by utilizing the reconstructed sparse map points and the camera shooting image to obtain an optimized camera pose and confidence;

and the correction positioning rendering module is used for performing human body position positioning and global motion correction to obtain final human body posture and position based on the human body state data and the optimized camera posture and confidence degree, and performing real-time rendering on the reconstructed sparse map points and the final human body posture and position to obtain a visual rendering result.

According to the human body motion capturing, positioning and environment mapping method and system based on the sparse sensor, good balance is achieved between map point constraint and motion capturing constraint, uncertainty in tracking is effectively reduced, and positioning accuracy is improved. And the invention obtains win-win results by fusing inertial motion capturing and simultaneous positioning and mapping technologies.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a sparse sensor based human motion capture, localization and environmental mapping method in accordance with an embodiment of the present invention;

FIG. 2 is a frame diagram of a sparse sensor based human motion capture, localization, and environmental mapping method in accordance with an embodiment of the present invention;

FIG. 3 is a schematic illustration of beam adjustment optimization for motion capture constraints in accordance with an embodiment of the present invention;

FIG. 4 is an exemplary diagram of sparse point reconstruction of a virtual scene in accordance with an embodiment of the present invention;

FIG. 5 is a diagram showing an example of practical application according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a human motion capturing, positioning and environmental mapping system based on sparse sensors according to an embodiment of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The human body motion capturing, positioning and environment mapping method and system based on the sparse sensor according to the embodiment of the invention are described below with reference to the accompanying drawings.

The present invention recognizes that the combined perception of human motion and environment is important for human interaction with the environment. First, the simultaneous perception of the human body and the environment may improve the efficiency and security of human interaction with the environment. For example, in an automatically driven automobile, sensing the behavior of the driver and the surrounding environment at the same time can better ensure the safety and smoothness of driving. Secondly, human body and environment are perceived simultaneously, so that higher-level human-computer interaction can be realized. For example, in virtual reality and augmented reality, perceiving both the user's actions and the surrounding environment may better enable an immersive experience. The human body and environment can be seen to be perceived simultaneously, so that more efficient, safer and more intelligent human-computer interaction and environment application experience can be brought to us.

FIG. 1 is a flow chart of a sparse sensor based human motion capture, localization and environmental mapping method in accordance with an embodiment of the present invention.

As shown in fig. 1, the method includes, but is not limited to, the steps of:

s1, acquiring an inertial measurement value of an IMU sensor and a camera shooting image;

s2, solving to obtain human body state data and camera state data based on the inertial measurement value;

s3, performing re-projection optimization on the camera state data by using the reconstructed sparse map points and the camera shooting images to obtain optimized camera pose and confidence;

and S4, based on the human body state data and the optimized camera pose and confidence, performing human body position positioning and global motion correction to obtain a final human body pose and position, and performing real-time rendering on the reconstructed sparse map points and the final human body pose and position to obtain a visual rendering result.

It can be appreciated that the invention uses only 6 IMU sensors and 1 monocular color camera for real-time human motion capture, human localization and ambient sparse point reconstruction. The 6 IMUs are worn on the lower arms, the lower legs, the head and the back of a human body, and the camera is fixed in front of the forehead of the human body and shoots outwards. Each IMU sensor measures 60 frames per second of orientation and acceleration information and the camera takes pictures at 30 frames per second. The method comprises human body motion capturing, camera tracking, map reconstruction, closed loop detection and human body motion updating, and the overall structure is shown in figure 2. The invention provides a deep coupling framework to fully utilize the complementary advantages of sparse inertial motion capture and SLAM technology. In this framework, human motion prior is combined with multiple key components of SLAM, and the positioning result of SLAM is also fed back to human motion capture. By jointly optimizing the camera gesture and the sparse map point position and combining the information captured by human body actions, the tracking and map construction precision and robustness are improved. When the visual features are reliable, SLAM may use the environmental information to correct drift in inertial motion capture; when visual features are poor due to camera occlusion or extreme illumination, inertial motion capture can provide pose and displacement estimates for the SLAM system, avoiding the situation of complete failure of previous SLAM systems. In addition, the present invention proposes novel map point confidence levels that facilitate the dynamic determination of the importance of each map point in the system key algorithm beam adjustment method (bundle adjustment). By reducing the influence of potential erroneous map points, the method of the invention realizes good balance between map point constraint and motion capture constraint, effectively reduces uncertainty in tracking and improves positioning accuracy. The invention obtains win-win results by integrating inertial motion capturing and simultaneous positioning and mapping technologies.

As shown in FIG. 2, the human body motion capturing, positioning and environment mapping method based on the sparse sensor of the invention specifically comprises the following steps:

and step 1, acquiring an inertial measurement value of the IMU and a color picture shot by the camera, and performing coordinate system calibration of inertial data and time synchronization of images and the inertial data.

And 2, capturing human body motions, and solving initial human body gestures, motions and root node accelerations, as well as camera gestures and motions by utilizing inertial measurement values of the IMU.

And 3, using the camera tracking, taking the camera pose and the motion information obtained in the step 2 as initial camera pose information, carrying out re-projection optimization on the camera pose by utilizing the color image shot by the camera and the reconstructed sparse map points, obtaining the optimized camera pose and confidence, and selecting a proper key frame according to the motion tracking condition of each frame.

And 4, performing sparse map point reconstruction and closed loop detection optimization through the key frames selected in the step 3 by using map reconstruction and closed loop detection. A beam adjustment method (bundle adjustment) algorithm is operated in the module to simultaneously optimize map point positions and camera poses, human body kinematics priori knowledge and map point confidence are introduced into the algorithm, and the optimization process is dynamically constrained by utilizing the motion capturing result. And when the human body motion is closed-loop, performing pose diagram optimization of inertial motion capture constraint. And finally obtaining the optimized sparse map points and key frame pose.

And 5, updating human body movement, and positioning the human body position and correcting global movement by using the acceleration and the human body posture of the human body root node obtained in the step 2 and the optimized camera pose and confidence obtained in the step 3 to obtain the accurate human body position.

And 6, rendering the reconstructed sparse map points in the step 4 and the human body posture and position obtained in the step 5 in real time, and visualizing a final result.

The human body motion capturing, positioning and environment mapping method based on the sparse sensor in the embodiment of the invention is explained in detail below with reference to the accompanying drawings.

In one embodiment of the invention, inertial measurement values of the IMU and color pictures shot by the camera are obtained, and coordinate system calibration of the inertial data is performed to eliminate the influence of different orientation of the sensor on a human body and align the data to a fixed global coordinate system for solving the human body posture; the time of the image and the inertial data are synchronized to match the image data and the inertial data at the corresponding time.

Specifically, the frame rate at which 6 IMUs acquire acceleration and angle measurements is 60Hz, and the frame rate at which cameras acquire pictures is 30Hz. In order to synchronize the time of the IMU and the camera, the human body is first required to wear all the sensors and perform a jump motion. By detecting the peak acceleration values measured by the IMUs and the abrupt moments of the camera images, the time stamps of the two are aligned and the measurements of the 6 IMUs are synchronized. After synchronization, the raw inertial measurements are taken at 60Hz and one frame of camera pictures are taken every other frame. And then, performing a T-phase sensor calibration process, wherein the step is consistent with the inertial motion capture work, and the IMU measured values can be aligned to a motion capture global coordinate system after the completion of the inertial motion capture work so as to facilitate the gesture solving by using a method in the motion capture work. Finally, requiring the person to be captured to walk an arc line, and obtaining the human motion trail of two arc lines under a SLAM global coordinate system and a motion capture global coordinate system by independently running a monocular simultaneous localization and map reconstruction (SLAM) algorithm and a sparse inertial sensor motion capture technology. And (3) obtaining the alignment relation between the SLAM and the motion capture global coordinate system and the map scale factor of the SLAM system by aligning the two sections of tracks. The following steps use two coordinate systems respectively, and related data can be transformed by the alignment matrix obtained in the step to be unified into the same coordinate system, which is not described in detail later. The trajectory alignment process may be implemented by an iterative closest point algorithm (Iterative Closest Point, ICP) with known matching relationships.

Further, human motion capture is used: and solving the initial human body posture, the motion and the root node acceleration, and the camera posture and the motion by using the inertial measurement value of the IMU. The human motion capture method based on the sparse inertial sensor is realized by using the PIP algorithm. But unlike PIP algorithms, which assume that the scene is only a flat ground, the present invention removes this assumption because the objective of the present invention is to achieve free motion capture in 3D space. Based on this, the present invention removes the force calculation and redesigns the dynamics optimizer in the PIP algorithm because we do not know the dense geometry of the scene (reconstruction is only done in sparse map points) and therefore cannot detect human-to-environment collisions to apply force on the contact points.

In one embodiment of the invention, the invention uses a motion estimation algorithm consistent with PIP, i.e., a multi-stage recurrent neural network is used to predict the motion state of the human body; its dynamics optimization algorithm is modified so that it supports free human motion in 3D space. Along with the symbols in PIP, the present invention defines a new dynamics optimization algorithm as:

wherein,,for human body posture and movement acceleration->And->Target angular acceleration and target angular acceleration given by a dual PD controller in PIP algorithmLinear acceleration, J is human joint jacobian matrix,>the joint linear velocity is C, which is a joint linear velocity constraint, described in detail below. The variables to be optimized in the optimization are body posture and motion acceleration +.>The optimal acceleration is solved by the optimization, so that the motion of a physical human body accords with an optimal control scheme given by a dual PD controller. Wherein C is defined as:

where σ=1e-3 is a sufficiently small speed limit, n _j The superscript x, z represents the x, z component of the three-dimensional vector for the number of collision joints in the algorithm. For specific details reference may be made to the PIP algorithm. Compared to PIP, the present invention removes the limitation of vertical speed, thereby enabling free motion in three-dimensional space. The solution method of the optimization and the complete human body posture and motion estimation are known. After the human body posture and the motion are obtained, the camera pose is obtained through a human body forward kinematics algorithm and is used for subsequent operation.

Further, camera tracking is used: taking the camera pose and the motion information obtained in the step 2 as initial camera pose information, carrying out re-projection optimization on the camera pose by utilizing the color image shot by the camera and the reconstructed sparse map points in the step 4, obtaining the optimized camera pose and the confidence coefficient, updating the human body position in the subsequent step, and selecting a proper key frame according to the motion tracking condition of the current frame.

In one embodiment of the present invention, this step is designed based on ORB-SLAM3, first extracted from the color imageAnd (3) performing feature matching on the ORB features and the reconstructed sparse map points in the step (4) by using feature similarity to obtain matched 2D-3D point pairs. The world coordinates of the map points are recorded asThe pixel sitting mark of the 2D image characteristic point matched with the pixel sitting mark isWhere i ε X represents all matching relationships. Use->Representing the initial camera pose before optimization, and optimizing the camera pose R and t under the constraint condition of motion capture:

wherein ρ (·) is a robust Huber kernel, Σ _i For a feature point scale dependent covariance matrix, depending on at which level of the image pyramid the map is detected, log: SO (3) →R ³ Mapping three-dimensional rotation defined in the plum cluster to a three-dimensional vector space, wherein pi (·) is three-dimensional projection operation of a pinhole camera model, and lambda _R And lambda (lambda) _t And capturing control coefficients of rotation and translation items for the motion. The optimization optimizes the camera pose by minimizing the re-projection error from the matched 3D map points to the 2D feature points of the image and enabling the camera pose to meet the pose constraint provided by human motion capture. The invention performs the optimization 3 times, wherein each time, the matching is classified as the correct matching according to the reprojection error of each pointAnd error matching, only correct matching is used in the next optimization, and the error matching is deleted. Through the strong priori knowledge provided by the motion capture constraint, the algorithm can better distinguish between correct and incorrect matches, thereby improving the tracking precision of the camera. Coefficient values of motion capture constraint terms lambda _R ＝0.01 ² And lambda (lambda) _t ＝0.5f ² s ² Where f is the camera focal length and s is the SLAM coordinate system scale factor obtained in step 2. By scaling the camera focal length and scale factor, the motion capture constrained error is unified to pixel units consistent with the re-projection error, thus making the optimization unaffected by the camera focal length or scale factor. The optimization problem is solved by the module through the Levenberg-Marquardt algorithm, and real-time performance is guaranteed by accelerating calculation through a g2o graph optimization technology. After the pose of the camera is solved, the module extracts the pair number n of correctly matched map points and takes the pair number n as the confidence of the pose.

It will be appreciated that in practice, the absolute position of the camera obtained by pure inertial motion capture is typically subject to substantial drift due to inertial error accumulation. Therefore, it is not feasible to use it directly for camera tracking optimization for longer sequences, as such nonlinear optimization usually requires a good initialization. Thus, the present invention performs one camera pose alignment step for each frame to eliminate drift before optimization.

Specifically, the present invention calculates the relative rotation and displacement of the camera from motion capture and adds it to the camera pose that was previously SLAM optimized. The invention uses R, t to represent the orientation and position of the camera, and the camera gesture extracted from motion capture is represented asThe camera pose alignment operation is:

wherein the subscript cur represents the current frame and the subscript last represents the last frame.

At the end of this step, the present invention selects key frames for subsequent map reconstruction and closed loop detection steps. The key frame selection scheme is a known scheme. But the invention captures the initial camera pose obtained by inertial motion at the same timeIncluded within the key frame for subsequent beam adjustment algorithms and closed loop detection algorithms.

Further, map reconstruction and closed loop detection are used: and (3) reconstructing sparse map points and optimizing closed loop detection through the key frames selected in the step (3). In this step, a beam adjustment method (Bundle Adjustment, BA) algorithm is run to simultaneously optimize sparse map point positions and camera poses, and to introduce human kinematics priori knowledge and map point confidence levels set forth below into the algorithm, dynamically constraining the optimization process with the human poses and motions estimated in step 2. And when the human body motion is closed-loop, performing inertial motion capture-assisted pose diagram optimization. And finally, obtaining the optimized sparse map point positions and key frame poses, and operating BA and closed-loop detection algorithm for the next frame. Sparse map point locations and keyframe poses are maintained and updated by the system.

Specifically, the present embodiment first assigns a confidence value to each map point that is to participate in BA optimization. Confidence c of ith map point _i Calculated by the following formula:

c _i ＝b _i θ _i ,

wherein b _i To observe the furthest distance of all keyframes of the map point, it is analogous to the baseline length (baseline) in three-dimensional reconstruction; θ _i The maximum included angle of the sight direction of the map point is observed; k is a superparameter chosen to be 50 such that the average value of the confidence approaches 1. The confidence of the map point is that when the key of the map point is observedThe frame is located in a sufficiently large range and the viewing angle is sufficiently large, the map point position is considered to be more accurate. This confidence is used for the subsequent BA optimization steps.

Then, the motion capture constraint beam adjustment method optimization is performed. As shown in fig. 3, the map point positions and the last 20 key frame camera poses are optimized simultaneously, while the other key frame camera poses that see these map points are fixed in the optimization. Note all optimizable keyframe sets as K _o All fixed keyframes are assembled as K _f The set of map points observed by keyframe j is denoted as X _j . R is recorded _o ＝R _j |j∈K _o }，T _o ＝t _j |j∈K _o The key frame orientation and three-dimensional position which need to be optimized are recordedRepresenting the location of map points in the world. The beam adjustment method optimization of the motion capture constraint is defined as:

where prev () represents the last key frame, μ, of key frame j _R ＝0.01 ² Sum mu _t ＝0.05 ² s ² For the coefficients of the motion capture constraint, its units are aligned to a pixel scale consistent with the re-projection error. The optimization requires that the re-projection error of the map points be small and that each key frameThe rotation and the relative position of the motion capture device are similar to the result of motion capture. By utilizing human motion priori provided by motion capture, more accurate simultaneous optimization of map points and camera poses is realized, wherein the confidence degree c of the map points _i The relative weight relation between the motion capture constraint item and the map point weight projection item is dynamically determined. The optimized solution algorithm is consistent with step 3, wherein map point locations need to be marginalized to accelerate the optimization algorithm.

When the track closed loop (i.e. the human body returns to the position which has been reached before) is detected, closed loop optimization is needed, and the closed loop optimization is designed, wherein the pose graph optimization algorithm is modified, so that the prior knowledge of the camera pose provided by motion capture is integrated. The vertex set in the pose graph is F, and the edge set is C.

In one embodiment of the present invention, the pose map optimization of the motion capture constraint proposed by the present invention is defined as:

wherein T is _j E SE (3) is the pose of key frame j, T _ij E SE (3) optimizes the relative pose between key frames i and j prior to the pose map,for capturing motion, obtaining initial value of camera pose, log: SE (3) →R ⁶ Mapping pose to six-dimensional vector space, ω _pose The relative coefficient of the motion capture constraint is taken to be 0.2.

Further, human motion update is used: and (3) positioning the human body position and correcting global motion by using the acceleration and the human body posture of the human body root node obtained in the step (2) and the optimized camera pose and confidence obtained in the step (3) to obtain an accurate human body position. The camera tracking module takes both visual and inertial information into account, so that the output 30 frames per second camera position can fine tune the motion of the human body from the inertial motion capture module at 60 frames per second. The present invention uses a prediction-correction algorithm in a Kalman filter to implement the module.

Specifically, the following state transition equation can be defined by noting that the global position and velocity of the human body are p, v, and the global acceleration extracted from the motion capture module is a:

p _k ＝ _k-1 + _k-1 Δt

v _k ＝ _k-1 + _k-1 Δt+q _k-1 ,

wherein the subscript k denotes the kth frame, Δt=1/60 is the time interval between frames, q to N (0, σ) ² I) Modeling motion capture prediction errors, the invention sets a variance term sigma=Δt, which means that the acceleration variance estimated by the network is assumed to be 1. Recording the optimized camera position in step 3 as p _cam With a confidence level of n, the state observation equation of 30Hz is defined as:

wherein the method comprises the steps ofThe three-dimensional position difference of the camera and the root node is calculated by a human forward kinematics algorithm in motion capture. r to N (0, sigma) _cam ) Modeling errors in camera tracking, where covariance matrix Σ _cam The following is approximated as a diagonal array:

where e=10 ^-3 Avoiding divisor 0, n being camera pose confidence calculated in step 3. According to the state transition equation and the state observation equation, the invention predicts the global position of the human body by using a Kalman filtering algorithm as output.

Further, rendering the reconstructed sparse map points in the step 4 and the human body posture and position obtained in the step 5 in real time, and visualizing a final result. FIG. 4 is an example of the results of the present invention running in a virtual scene disclosing an inertial motion capture dataset, where reconstructed sparse map points are rendered. Fig. 5 shows an actual operation example of the present invention, in which the motion of an actual human body is shown on the left side, and the human body posture, the human body motion track, and the reconstructed sparse map points estimated by the system are shown on the right side.

In summary, the invention is based on human motion, usually captured by inertial sensors, while the environment is reconstructed mainly using cameras. The two technologies are integrated together, a method for simultaneously executing human motion capture, positioning and environment mapping is developed, human motion is obtained in real time, and positioning and map sparse point reconstruction are performed.

In summary, the present invention provides a technology for capturing, positioning and environmental mapping of human body actions in real time while using only 6 Inertial Sensors (IMUs) and 1 monocular color camera for the first time. While inertial motion capture (mocap) techniques explore "internal" information such as human motion signals and motion priors, simultaneous localization and mapping (SLAM) techniques rely primarily on "external" information, i.e., the environment captured by the camera. The former has good stability, but global position drift accumulates over long movements due to no external correct reference; the latter can estimate the global position in the scene with high accuracy, but tracking loss easily occurs when the environmental information is unreliable (e.g. no texture or occlusion is present). Therefore, the invention effectively combines the two complementary technologies (mocap and SLAM), and realizes the stable and accurate human body positioning and map reconstruction by fusing human body motion priori and visual tracking on a plurality of key algorithms. The 6 IMUs used in the invention are worn on the limbs, the head and the back of a person, and the monocular color camera is fixed on the head and shoots outwards. This design is inspired by the real human behaviour: when humans are in a new environment, they acquire information of the scene with their eyes and plan their actions in the scene. The monocular camera acts as an eye for a human, providing real-time scene reconstruction and self-positioning visual signals for the present technology, while the IMU measures the motion of the human's most important, i.e. limbs and head, in the scene. The whole system realizes simultaneous human motion capture and environment sparse point reconstruction based on only 6 IMUs and 1 camera for the first time, the running speed reaches 60fps on a CPU, and the most advanced technology in two fields is simultaneously exceeded in precision.

According to the human body motion capturing, positioning and environment mapping method based on the sparse sensor, human body motion is obtained in real time, positioning and map sparse point reconstruction are carried out, real-time simultaneous human body and environment perception based on the sparse wearable sensor is realized for the first time, and positioning accuracy is greatly improved compared with the most advanced technology in two fields.

In order to implement the above embodiment, as shown in fig. 6, a human motion capturing, positioning and environment mapping system 10 based on sparse sensors is further provided in this embodiment, where the system 10 includes a data acquisition module 100, a data solving module 200, a re-projection optimizing module 300, and a corrective positioning rendering module 400.

The data acquisition module 100 is used for acquiring inertial measurement values of the IMU sensor and camera shooting images;

the data solving module 200 is used for solving to obtain human body state data and camera state data based on the inertial measurement value;

the re-projection optimization module 300 is configured to perform re-projection optimization on the camera state data by using the reconstructed sparse map points and the camera-shot image to obtain an optimized camera pose and confidence;

the correction positioning rendering module 400 is configured to perform human body position positioning and global motion correction to obtain a final human body posture and position based on the human body state data and the optimized camera pose and confidence, and perform real-time rendering on the reconstructed sparse map points and the final human body posture and position to obtain a visual rendering result.

Further, after the data acquisition module 100, the method further includes:

and the data synchronization module is used for carrying out coordinate system calibration of the inertial measurement value based on the inertial measurement value of the IMU sensor and the camera shooting image and carrying out time synchronization of the camera shooting image and the inertial measurement value to obtain a coordinate system calibration result and a time synchronization result.

Further, after the above-mentioned re-projection optimization module 300, a key frame selection module is further included,

and the key frame selection module is used for selecting the key frame according to the motion tracking state of the current frame based on a preset key frame selection scheme.

Further, the re-projection optimization module 300 is further configured to take the camera status data as an initial camera pose, and includes:

extracting ORB characteristics in an image shot by a camera;

Further, after the key frame selection module, the method further comprises a map reconstruction and closed loop detection module for:

performing motion capture constrained beam adjustment optimization based on human body state data and key frames to simultaneously optimize the position and camera pose of the reconstructed sparse map points; the method comprises the steps of,

According to the human motion capturing, positioning and environment mapping system based on the sparse sensor, human motion is obtained in real time, positioning and map sparse point reconstruction are performed, real-time simultaneous human and environment perception based on the sparse wearable sensor is realized for the first time, and positioning accuracy is greatly improved compared with the most advanced technology in two fields.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Claims

1. A human body motion capturing, positioning and environment mapping method based on a sparse sensor is characterized by comprising the following steps:

2. The method of claim 1, wherein after the acquiring inertial measurements of the IMU sensor and the camera capturing the image, the method further comprises:

3. The method of claim 1, wherein after the optimized camera pose and confidence is obtained, the method further comprises:

4. The method of claim 1, wherein taking the camera state data as an initial camera pose comprises:

extracting ORB characteristics in the image shot by the camera;

5. A method according to claim 3, characterized in that the method further comprises:

6. A human motion capture, localization and environmental mapping system based on sparse sensors, comprising:

7. The system of claim 6, further comprising, after the data acquisition module:

8. The system of claim 6, further comprising a keyframe selection module after the re-projection optimization module,

the key frame selection module is used for selecting the key frame according to the motion tracking state of the current frame based on a preset key frame selection scheme.

9. The system of claim 6, wherein the re-projection optimization module is further configured to take the camera state data as an initial camera pose, comprising:

extracting ORB characteristics in the image shot by the camera;

10. The system of claim 8, further comprising, after the key frame selection module, a map reconstruction and closed loop detection module to: