CN114359338A

CN114359338A - Pose estimation method and device, terminal equipment and computer readable storage medium

Info

Publication number: CN114359338A
Application number: CN202111536038.7A
Authority: CN
Inventors: 温焕宇; 庞建新; 熊友军; 谭欢
Original assignee: Shenzhen Ubtech Technology Co ltd
Current assignee: Shenzhen Ubtech Technology Co ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-04-15

Abstract

The application is applicable to the technical field of robots, and provides a pose estimation method, a pose estimation device, terminal equipment and a computer readable storage medium, wherein the pose estimation method comprises the following steps: detecting feature points and feature line segments in a first target video, wherein the target video is acquired through a shooting device installed on the robot; tracking the feature points in the first target video to obtain first point feature information; matching the characteristic line segments in the first target video to obtain first line characteristic information; acquiring first inertial navigation data in a time period corresponding to the first target video, wherein the inertial navigation data is acquired through an inertial sensor arranged on the robot; and estimating the pose of the robot according to the first inertial navigation data, the first point characteristic information and the first line characteristic information. By the method, the pose estimation precision of the robot can be effectively improved.

Description

Pose estimation method and device, terminal equipment and computer readable storage medium

Technical Field

The application belongs to the technical field of robots, and particularly relates to a pose estimation method and device, a terminal device and a computer readable storage medium.

Background

The simultaneous localization and mapping (SLAM) technology is widely applied to robots. When the robot is put into an unknown position in an unknown environment, a map of the current environment can be gradually depicted by the SLAM technique as the robot moves.

In the SLAM process, the pose of the robot needs to be estimated in real time along with the movement of the robot. In the prior art, a camera mounted on a robot acquires continuous image frames, detects image features from the continuous image frames, and estimates the pose of the robot according to the image features. However, the shot image is easily influenced by ambient light, and especially when the robot moves at a high speed, frame loss, a jelly effect and the like easily occur, so that the accuracy of pose estimation is influenced.

Disclosure of Invention

The embodiment of the application provides a pose estimation method and device, terminal equipment and a computer readable storage medium, and can effectively improve the pose estimation precision of a robot.

In a first aspect, an embodiment of the present application provides a pose estimation method, including:

detecting feature points and feature line segments in a first target video, wherein the target video is acquired through a shooting device installed on the robot;

tracking the feature points in the first target video to obtain first point feature information;

matching the characteristic line segments in the first target video to obtain first line characteristic information;

acquiring first inertial navigation data in a time period corresponding to the first target video, wherein the inertial navigation data is acquired through an inertial sensor arranged on the robot;

and estimating the pose of the robot according to the first inertial navigation data, the first point characteristic information and the first line characteristic information.

In the embodiment of the application, visual characteristic information, namely point characteristic information and line characteristic information, is obtained by carrying out image processing on a target video; and then estimating the pose of the robot according to the inertial navigation data and the visual characteristic information, which is equivalent to simultaneously considering the pose obtained by the inertial navigation data and the pose obtained by the visual detection. Because the inertial sensor has the high-frequency anti-interference characteristic, the pose is jointly estimated according to the inertial navigation data and the visual characteristic information, so that the pose estimation error caused by visual influence is favorably avoided; especially, when the robot moves at a high speed, the loss of the characteristic data can be effectively prevented, and the accuracy of the characteristic data is improved. In addition, the visual characteristic information comprises point characteristic information and line characteristic information, and the pose is estimated through the combination of different types of image characteristics, so that the pose estimation precision is further improved.

In a possible implementation manner of the first aspect, the detecting feature points and feature line segments in the first target video includes:

for each frame of image in the first target video, predicting the position of a characteristic point and the position of a characteristic line segment in the image according to the first inertia data;

determining a detection range in the image according to the predicted positions of the feature points and the feature line segments in the image;

and detecting the characteristic points and the characteristic line segments in the image in the detection range.

In a possible implementation manner of the first aspect, the matching the feature line segment in the first target video to obtain first line feature information includes:

matching feature line segments in any two adjacent frames of images of the first target video to obtain a plurality of line segment groups, wherein each line segment group comprises two matched feature line segments;

and deleting the line segment groups meeting preset conditions in the plurality of line segment groups, wherein the line segment groups meeting the preset conditions are that the position difference of two characteristic line segments in the line segment groups on the image is greater than a first preset value or the angle difference is greater than a second preset value, and the first line characteristic information comprises the rest line segment groups.

In a possible implementation manner of the first aspect, after estimating the pose of the robot according to the first inertial navigation data, the first point feature information, and the first line feature information, the method further includes:

when a second target video is acquired, acquiring second inertial navigation data in a time period corresponding to the second target video, wherein the second target video is a video acquired within a first preset time after the first target video is acquired;

performing pre-integration processing on the second inertial navigation data to obtain pre-integration data;

acquiring second point characteristic information and second line characteristic information of the second target video;

and performing local optimization processing on the pose of the robot according to the pre-integral data, the second point characteristic information and the second line characteristic information.

In a possible implementation manner of the first aspect, after acquiring the first inertial data within the time period corresponding to the first target video, the method further includes:

acquiring first point cloud data in a time period corresponding to the first target video, wherein the point cloud data is acquired through a radar installed on the robot, and the first point cloud data comprises a plurality of scanning points and a depth value of each scanning point;

performing association processing on the scanning points in the first point cloud data and the feature points in the first target video to obtain depth values of the feature points in the first target video;

and estimating the pose of the robot according to the depth values of the feature points in the first target video, the first inertial navigation data, the first point feature information and the first line feature information.

In a possible implementation manner of the first aspect, the associating the scanning point in the first point cloud data with the feature point in the first target video to obtain a depth value of the feature point in the first target video includes:

respectively mapping the scanning points in the first point cloud data and the characteristic points in the first target video to a preset plane;

acquiring N second mapping points within a preset range of a first mapping point in the preset plane, wherein the first mapping point is a mapping point of a characteristic point in the first target video in the preset plane, the second mapping point is a mapping point of a scanning point in the first point cloud data in the preset plane, and N is a positive integer;

and if the position difference between the N second mapping points is smaller than a third preset value, determining the average depth value of the N second mapping points as the depth value of the first mapping point.

In one possible implementation manner of the first aspect, after estimating the pose of the robot according to the depth values of the feature points in the first target video, the first inertial navigation data, the first point feature information, and the first line feature information, the method further includes:

when a third target video is obtained, performing visual loop detection on the third target video to obtain a detection result, wherein the third target video is obtained within a second preset time after the first target video is obtained;

and if the detection result shows that loop occurs, performing global optimization processing on the pose of the robot according to the first target video and the third target video.

In a second aspect, an embodiment of the present application provides a pose estimation apparatus, including:

the characteristic detection unit is used for detecting characteristic points and characteristic line segments in a first target video, wherein the target video is acquired through a shooting device installed on the robot;

the characteristic tracking unit is used for tracking the characteristic points in the first target video to obtain first point characteristic information;

the characteristic matching unit is used for matching the characteristic line segments in the first target video to obtain first line characteristic information;

the data acquisition unit is used for acquiring first inertial navigation data in a time period corresponding to the first target video, wherein the inertial navigation data is acquired through an inertial sensor arranged on the robot;

and the first estimation unit is used for estimating the pose of the robot according to the first inertial navigation data, the first point characteristic information and the first line characteristic information.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the pose estimation method according to any one of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, and the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the pose estimation method according to any one of the above first aspects.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the pose estimation method according to any one of the first aspect.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a pose estimation method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a feature line segment provided in an embodiment of the present application;

FIG. 3 is a frame schematic diagram of a binocular vision inertial odometer utilizing dotted line features as provided by an embodiment of the present application;

fig. 4 is a schematic flowchart of a pose estimation method according to another embodiment of the present application;

FIG. 5 is a schematic diagram of global pose optimization provided by an embodiment of the present application;

FIG. 6 is a block diagram representation of a binocular vision inertial odometer and radar fusion with dotted line features provided by an embodiment of the present application;

fig. 7 is a block diagram of a pose estimation apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when.. or" upon "or" in response to a determination "or" in response to a detection ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.

The pose estimation method provided by the embodiment of the application can be applied to an application scene of simultaneous localization and mapping (SLAM). In this application scenario, the robot draws a map of the current environment while moving. The robot is provided with a shooting device, an inertial sensor and a radar. The pose of the robot can be estimated in real time by the pose estimation method, and the positions of the feature points/lines on each frame of image in the detected video in the three-dimensional space are spliced together according to the pose of the robot so as to construct the three-dimensional map of the current environment.

Referring to fig. 1, which is a schematic flow chart of a pose estimation method provided in an embodiment of the present application, by way of example and not limitation, the method may include the following steps:

s101, detecting feature points and feature line segments in the first target video.

Wherein the target video is acquired by a photographing device installed on the robot.

Specifically, for each frame of image in the first target video, feature points and feature line segments are detected. The characteristic point may be a key point on an object in the current environment, such as a corner point of a table, a rotation axis point of a watch hand on a clock, and the like. The characteristic line segment may be a line segment for characterizing the shape of an object, such as an edge line of a table, a skirting line of a wall, and the like.

In the embodiment of the present application, feature points may be detected by using a feature From Accessed Segment Test (FAST), and feature line segments may be detected by using a Line Segment Detector (LSD).

Optionally, the pose estimation method in the embodiment of the present application may be executed once every time a target video with a preset duration is acquired. The preset time duration can be preset according to actual needs. When the preset time is long, the pose estimation instantaneity is poor, but the estimation precision is high; on the contrary, when the preset time is short, the pose estimation is good in real-time performance, but the estimation accuracy is low.

In one embodiment, one implementation of S101 includes:

for each frame of image in the first target video, predicting the position of a feature point and the position of a feature line segment in the image according to the first inertia data; determining a detection range in the image according to the predicted position of the feature point in the image and the predicted position of the feature line segment; and detecting the characteristic points and the characteristic line segments in the image in the detection range.

Due to the fact that the sampling frequency of the inertial navigation data is high, the inertial navigation data can be used for predicting the feature position in the next frame of image, and then the feature point and the feature line segment are further detected within a certain range of the predicted feature position. By the method, the range of feature detection can be narrowed, the data processing amount in the process of feature detection is reduced, and the feature detection efficiency is improved.

S102, tracking the characteristic points in the first target video to obtain first point characteristic information.

The feature points are tracked, namely, the feature points are taken as targets, and the feature points can be tracked by utilizing the existing target tracking algorithm.

In the embodiment of the present application, an optical flow tracking method may be adopted to track the feature points in the first target video. The optical flow is the instantaneous velocity of pixel motion of a spatially moving object on the viewing imaging plane. The optical flow tracking method is a method for calculating motion information of an object between adjacent frames by using the change of pixels in an image sequence in a time domain and the correlation between adjacent frames to find the corresponding relationship between a previous frame and a current frame. The motion track of the robot can be calculated by the motion information of the object calculated by the optical flow tracking method. The first point characteristic information in the embodiment of the present application may include information used for calculating a motion trajectory of the robot, such as a speed, a position, a moving direction, and the like.

S103, matching the characteristic line segments in the first target video to obtain first line characteristic information.

In the embodiment of the application, the feature line segments can be matched by using a descriptor matching method. Each feature line segment is described by a descriptor, and then feature line segments with the same descriptor in two adjacent frames of images are matched together. The descriptor may be used to represent a position feature, a shape feature, and the like of a feature line segment, for example, the feature line segment is an edge at a certain position of a table, and the feature line segment is a ground horizontal line extending from a certain wall corner of a wall surface.

In the embodiment of the present application, two line feature descriptor expression methods may be adopted: prock coordinates and orthogonal coordinates. The Prock coordinates are determined by 6 parameters and the orthogonal coordinates by 4 parameters. The Prock coordinates can be used for the coordinate transformation and the reprojection processing of the front end, and the orthogonal coordinates can be used for the optimization processing of the back end.

In the process of matching the characteristic line segments, the selected matching conditions are likely to be looser, so that the probability of successful matching is increased, and the matching precision is reduced. To improve matching accuracy, in one embodiment, one implementation of matching the feature line segments includes:

matching feature line segments in any two adjacent frames of images of the first target video to obtain a plurality of line segment groups, wherein each line segment group comprises two matched feature line segments; and deleting the line segment groups meeting preset conditions in the line segment groups, wherein the line segment groups meet the preset conditions that the position difference of two characteristic line segments in the line segment groups on the image is greater than a first preset value or the angle difference is greater than a second preset value, and the first line characteristic information comprises the rest line segment groups.

Fig. 2 is a schematic diagram of a feature line segment provided in the embodiment of the present application. As shown in fig. 2, δ d in the figure represents a position difference between two characteristic line segments, and δ γ in the figure represents an angle difference between two characteristic line segments.

By the method, the feature line segments which are successfully matched are screened, and the matching relation of the feature line segments with larger angle and position difference is deleted, so that the matching precision is improved.

Through the matching processing of the characteristic line segments, the motion information of the object between the adjacent frames can be determined through the correlation of the characteristic line segments between the adjacent frames, and then the motion trail of the robot is calculated. The first line characteristic information in the embodiment of the present application may include information used for calculating a motion trajectory of the robot, such as a speed, a position, a moving direction, and the like.

S104, acquiring first inertial navigation data in a time period corresponding to the first target video.

The Inertial navigation data is acquired by an Inertial sensor (IMU) mounted on the robot. The IMU contains a gyroscope and an accelerometer inside, and can measure the acceleration and angular velocity of a moving object. The rotation, pose and velocity information (P, v, q) can be integrated by kinematic formulas after the acceleration and acceleration information is known.

As described above, the sampling frequency of the inertial sensor is generally high, while the sampling frequency of the camera is relatively low, which may not be possible due to the inconsistency between the sampling frequencies of the inertial data and the camera. For example, when 10 sets of inertial navigation data are acquired, one frame of image is acquired. If the inertial navigation data and the visual data cannot be subjected to the time stamps, data fusion cannot be carried out subsequently.

In order to solve the problem, in the embodiment of the present application, the inertial navigation data may be subjected to pre-integration processing first. The pre-integration processing process is to integrate the inertial navigation data within a period of time to obtain a set of inertial navigation data. Through pre-integration processing, multiple sets of inertial navigation data can be integrated into 1 set of inertial navigation data, and then the 1 set of inertial navigation data is time-aligned with an image.

And S105, estimating the pose of the robot according to the first inertial navigation data, the first point characteristic information and the first line characteristic information.

In an implementation manner, the pose of the robot can be estimated according to the first inertial navigation data, the first point feature information and the first line feature information, and then the estimated multiple sets of poses are weighted to obtain the final pose.

And optimizing the pose in order to improve the pose estimation precision. In one embodiment, after S105, the method further comprises:

when a second target video is obtained, obtaining second inertial navigation data in a time period corresponding to the second target video, wherein the second target video is a video obtained within a first preset time after the first target video is obtained; performing pre-integration processing on the second inertial navigation data to obtain pre-integration data; acquiring second point characteristic information and second line characteristic information of a second target video; and performing local optimization processing on the pose of the robot according to the pre-integral data, the second point characteristic information and the second line characteristic information.

The first preset time is equivalent to a sliding time window. And obtaining a second target video in the time window after the first target video is obtained, and performing optimization processing by using the second target video. And by analogy, the time window is backwards slid, and optimization processing is performed by using the latest acquired target video every time the target video in the latest time window is acquired.

The process of performing pre-integration processing on the second inertial navigation data may refer to the description in the embodiment of S104, and is not described herein again.

It should be noted that, as described in the S104 embodiment, the sampling frequency of the IMU is high, typically 100Hz to 1000Hz, and the data size is very large. When doing the optimization, it is not possible to put so much data into the state variables, so it is common practice to fetch one data at intervals, such as every 1 second. The basic process is as follows: knowing the PVQ (rotation, pose, and velocity) at the ith second and all data in between (e.g., 100) at the ith and jth seconds, the PVQ at the jth second is obtained by integrating the data from the ith second point by point. However, in the process of back-end optimization, when iterative solution calculation is performed to update and adjust the PVQ value, once the PVQ at a certain time is adjusted, each intermediate process and all the following traces need to be integrated again, and if 100Hz exists, 100 integrals need to be calculated for 100 data acquisition times in two seconds. The pre-integration aims to change the 100 integration processes into only 1 integration, or replace 100 values with 1 value, and the application of the pre-integration model can greatly save the calculation amount.

Optionally, the objective function jointly optimized according to the inertial navigation data and the visual data is as follows:

wherein r_p,J_pThe representation of the prior information residual is obtained by sliding window marginalization. J. the design is a square_pIs the Hessian (Hessian) matrix obtained after previous optimization. ρ is a robust kernel function used to suppress outliers.

Is the IMU pre-integration residual at the k frame and k +1,

is the set of all IMU pre-integration measurements within the sliding window (i.e., the first preset time).

Is the reprojection error of the feature point n to the jth frame image,

is the reprojection error of the feature line segment o to the jth frame image,

and

is a set of feature points and a set of feature line segments observed by the camera. The objective function is equivalent to a least square problem, and in the embodiment of the present application, the LM algorithm may be used to solve the objective function. Illustratively, referring to fig. 3, a frame diagram of a binocular vision inertial odometer using a dotted line feature is provided according to an embodiment of the present application. As shown in fig. 3, the process of acquiring sensor data and pre-processing the acquired data may be performed by the front end, while the process of optimizing may be performed by the back end. In this embodiment, the data processing tasks at the front end and the back end may be executed by two processes respectively. The initialization process can adopt a loose fusion mode of visual data and inertial navigation data to obtain the optimized initial state quantity. The front end carries out pre-integration processing on IMU data, carries out line feature detection and point feature detection on a visual image respectively, and carries out line feature matching processing and point feature tracking processing. And during rear-end optimization, performing optimization processing of the pose with 6 degrees of freedom according to inertial navigation data and visual data in the sliding window.

The triangulation can calculate the depth information of points and lines in the image in the three-dimensional space, so as to determine the positions of the points and the lines in the three-dimensional space. When a three-dimensional map of the environment is established, according to the determined 6-degree-of-freedom pose, the positions of the feature points and the feature line segments in different frame images in the three-dimensional space are spliced together to obtain the three-dimensional map.

In the embodiment of fig. 1, visual feature information, i.e., point feature information and line feature information, is obtained by performing image processing on a target video; and then estimating the pose of the robot according to the inertial navigation data and the visual characteristic information, which is equivalent to simultaneously considering the pose obtained by the inertial navigation data and the pose obtained by the visual detection. Because the inertial sensor has the high-frequency anti-interference characteristic, the pose is jointly estimated according to the inertial navigation data and the visual characteristic information, so that the pose estimation error caused by visual influence is favorably avoided; especially, when the robot moves at a high speed, the loss of the characteristic data can be effectively prevented, and the accuracy of the characteristic data is improved. In addition, the visual characteristic information comprises point characteristic information and line characteristic information, and the pose is estimated through the combination of different types of image characteristics, so that the pose estimation precision is further improved.

In one embodiment, to further improve the pose estimation accuracy, feature data, such as point cloud data acquired by radar, may also be added. Fig. 4 is a schematic flow chart of a pose estimation method according to another embodiment of the present application. As shown in fig. 4, the method further comprises:

s401, detecting a feature point and a feature line segment in a first target video.

S402, tracking the characteristic points in the first target video to obtain first point characteristic information.

S403, performing matching processing on the feature line segments in the first target video to obtain first line feature information.

S404, acquiring first inertial navigation data in a time period corresponding to the first target video.

The steps S401 to S404 are the same as those of S101 to S104, and reference may be made to the description in the embodiments of S101 to S104, which is not described herein again.

S405, first point cloud data in a time period corresponding to the first target video are obtained.

The point cloud data is acquired through a radar installed on the robot, and the first point cloud data comprises a plurality of scanning points and the depth value of each scanning point.

S406, performing association processing on the scanning points in the first point cloud data and the feature points in the first target video to obtain depth values of the feature points in the first target video.

And (4) correlating the scanning points in the point cloud data with the characteristic points in the target video, and substantially finding the corresponding relation between the radar data and the visual data. The conversion relationship between the radar coordinate system and the camera coordinate system can be calibrated in advance, and the mapping relationship between the scanning points and the characteristic points can be determined according to the conversion relationship.

In the association process, one scanning point and one feature point are usually associated, but the number of scanning points is usually large, and scanning points in a partial area are dense, and if one scanning point and one feature point are associated, the problem that the depth value of the feature is inaccurate due to association errors may exist, thereby affecting the pose estimation accuracy. To address this issue, in one embodiment, S406 may include:

respectively mapping scanning points in the first point cloud data and characteristic points in the first target video to a preset plane; acquiring N second mapping points within a preset range of the first mapping point in a preset plane; and if the position difference between the N second mapping points is smaller than a third preset value, determining the average depth value of the N second mapping points as the depth value of the first mapping point.

The first mapping point is a mapping point of a feature point in the first target video in a preset plane, the second mapping point is a mapping point of a scanning point in the first point cloud data in the preset plane, and N is a positive integer. The preset plane may be a plane located in a unit spherical coordinate system conforming to the camera coordinate system.

By the method, if the position difference between the N second mapping points is smaller than the third preset value, the N second mapping points are closer to each other, and the average depth value of the N second mapping points is taken as the depth value of the first mapping point, so that the problem that the depth value of the feature is inaccurate due to correlation errors can be avoided, and the pose estimation precision is further influenced.

S407, estimating the pose of the robot according to the depth value of the feature point in the first target video, the first inertial navigation data, the first point feature information and the first line feature information.

In one implementation, the pose of the robot may be estimated according to the depth values of the feature points in the first target video, the first inertial navigation data, the first point feature information, and the first line feature information, and then the estimated multiple sets of poses are weighted to obtain the final pose.

In one embodiment, after S407, the method further comprises:

when a third target video is obtained, performing visual loop detection on the third target video to obtain a detection result; and if the detection result shows that loop occurs, performing global optimization processing on the pose of the robot according to the first target video and the third target video.

And the third target video is a video acquired within a second preset time after the first target video is acquired.

The visual loop detection is used for judging whether the current frame is similar to the historical frame or not, and whether the robot passes through the same place or not can be judged through loop detection. If loop detection is detected, more effective data can be provided for the pose graph at the rear end, so that better estimation can be obtained, particularly a globally consistent estimation can be obtained, and relocation or pose optimization can be performed by utilizing loop detection.

When the loop of the image in the key frame database (the historical frame is stored in the database) and the marginalized image in the sliding window is detected, the system can execute pose graph optimization on all the key frames in the database. Fig. 5 is a schematic diagram of global pose optimization provided in the embodiment of the present application. As shown in fig. 5, the frame in the dashed line frame is an edged frame, and the dashed line represents that a loop occurs, i.e., a loop edge, and the pose is revised after global pose graph optimization.

In the embodiment of the application, a DBoW2 bag-of-words model established based on visual BRIEF descriptors can be adopted. The system will supplement the newly detected corner points for the new key frame and perform BRIEF descriptor calculation for all feature points. The system calculates the similarity between the current frame image and the historical frame, and if the similarity meets the threshold score, the occurrence of loop is judged. When the loop occurs, the system performs pose graph optimization, namely global graph optimization according to all the frame images acquired currently.

One way of calculating the similarity may be: along with the movement of the robot, the feature point information in the historical frame image can be stored, and the historical feature point information and the feature point information in the current frame are matched; and if the ratio of the number of the matched feature points to the total number of the feature points in the current frame is determined, taking the ratio as the similarity. If the similarity meets the preset score, the current frame is similar to the historical frame, and then loop returning occurs. And then, modifying the current characteristic points according to the historical characteristic points, namely a global graph optimization process.

When the historical feature points and the current frame feature points are matched, the feature points can be described by using a BRIEF descriptor, and the feature points with the same descriptor can be determined as matched feature points.

Unlike the local optimization in the embodiment of fig. 1, the local optimization is performed according to the latest various data in the sliding time window, and the global graph optimization is performed on all the currently acquired visual data. In addition, the local optimization adopts various data, including visual data, inertial navigation data and point cloud data; while global graph optimization employs visual data. The frequency of local optimization is relatively high, while the accuracy of the global graph optimization results is relatively high. In practical application, the system may perform local optimization according to the sliding time window timing, and adopt global graph optimization when a loop is detected.

After the step S105, the steps of the visual loop detection and the global optimization may be performed. After S407, the procedure of the local optimization processing described above may be executed.

Since the gravity direction is aligned to the inertial coordinate system during initialization, the roll angle phi and the pitch angle theta of the system can be observed, namely the roll angle and the pitch angle are considerable. Therefore, when the global pose graph is optimized, only the position (x, y, z) and the yaw angle ψ, i.e., the yaw angle, need to be optimized for a total of 4 degrees of freedom. The residual of the ith and jth frames can be defined as:

wherein the content of the first and second substances,

are respectively provided withAnd the roll angle and the pitch angle estimated by the visual inertia odometer are represented.

Respectively, the sequence edges between the ith frame and the jth frame, namely the relative transition between two key frames in the sliding window. Meanwhile, the looping side represents the pose transformation of 4 degrees of freedom between two frames with correct looping detection. Finally all the sequence edges are processed

And a loop edge

And (6) optimizing.

The global graph optimization objective function is:

where ρ represents a robust kernel function.

Through the loop detection, accumulated errors can be corrected, loop constraints are added in the subsequent optimization process, and the accuracy of the optimization result can be effectively improved.

Illustratively, referring to fig. 6, a frame diagram of a fusion of a binocular vision inertial odometer and a radar using a dotted line feature is provided in an embodiment of the present application. As shown in fig. 6, on the basis of the embodiment of fig. 3, radar point cloud data is added, data types are increased, and detection accuracy is further improved. By associating the point cloud and the point features, more accurate depth values of the point features may be obtained. In the embodiment of the application, visual loopback detection is added for global pose graph optimization to obtain poses with 4 degrees of freedom (the 4-parameter orthogonal representation of the characteristic line can be used for the global pose graph optimization).

In the SLAM process, depth values of point features determined by point cloud and point feature association may be prioritized. And when the point cloud data is not acquired, acquiring the depth value of the point feature through triangulation processing.

In the embodiment of fig. 4, visual feature information, that is, point feature information and line feature information, is obtained by performing image processing on a target video; through the association of the point cloud data and the feature points, a more accurate depth value of each feature point can be obtained; estimating the pose of the robot according to the depth values of the feature points, the inertial navigation data and the visual feature information, which is equivalent to simultaneously considering the depth information acquired by a radar, the pose acquired by the inertial navigation data and the pose acquired by visual detection; by the method, the detection data of different types are fused, and the pose of the robot is comprehensively estimated by using the detection data of different types, so that the pose estimation precision is greatly improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 7 is a block diagram of the pose estimation apparatus provided in the embodiment of the present application, which corresponds to the pose estimation method described in the above embodiment, and only shows the relevant portions in the embodiment of the present application for convenience of explanation.

Referring to fig. 7, the apparatus includes:

a feature detection unit 701, configured to detect a feature point and a feature line segment in a first target video, where the target video is acquired by a shooting device installed on the robot.

A feature tracking unit 702, configured to track feature points in the first target video to obtain first point feature information.

A feature matching unit 703, configured to perform matching processing on the feature line segment in the first target video to obtain first line feature information.

A data obtaining unit 704, configured to obtain first inertial navigation data in a time period corresponding to the first target video, where the inertial navigation data is obtained through an inertial sensor installed on the robot.

A first estimating unit 705, configured to estimate a pose of the robot according to the first inertial navigation data, the first point feature information, and the first line feature information.

Optionally, the feature detection unit 701 is further configured to:

Optionally, the feature matching unit 703 is further configured to:

Optionally, the apparatus 7 further comprises:

the local optimization unit 706 is configured to, after estimating the pose of the robot according to the first inertial navigation data, the first point feature information, and the first line feature information, when a second target video is obtained, obtain second inertial navigation data in a time period corresponding to the second target video, where the second target video is a video obtained within a first preset time after the first target video is obtained; performing pre-integration processing on the second inertial navigation data to obtain pre-integration data; acquiring second point characteristic information and second line characteristic information of the second target video; and performing local optimization processing on the pose of the robot according to the pre-integral data, the second point characteristic information and the second line characteristic information.

Optionally, the apparatus 7 further comprises:

a point cloud obtaining unit 707, configured to obtain first point cloud data in a time period corresponding to the first target video after obtaining first inertial data in the time period corresponding to the first target video, where the point cloud data is obtained by a radar installed on the robot, and the first point cloud data includes a plurality of scan points and a depth value of each scan point.

A feature associating unit 708, configured to perform association processing on the scanning point in the first point cloud data and the feature point in the first target video to obtain a depth value of the feature point in the first target video.

A second estimating unit 709, configured to estimate a pose of the robot according to the depth values of the feature points in the first target video, the first inertial navigation data, the first point feature information, and the first line feature information.

Optionally, the feature associating unit 708 is further configured to:

Optionally, the apparatus 7 further comprises:

a global optimization unit 710, configured to, after estimating a pose of the robot according to depth values of feature points in the first target video, the first inertial navigation data, the first point feature information, and the first line feature information, perform, when a third target video is obtained, visual loop detection on the third target video to obtain a detection result, where the third target video is a video obtained within a second preset time after the first target video is obtained; and if the detection result shows that loop occurs, performing global optimization processing on the pose of the robot according to the first target video and the third target video.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

In addition, the pose estimation apparatus shown in fig. 7 may be a software unit, a hardware unit, or a combination of software and hardware unit that is built in the existing terminal device, may be integrated into the terminal device as an independent pendant, or may exist as an independent terminal device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 8, the terminal device 8 of this embodiment includes: at least one processor 80 (only one shown in fig. 8), a memory 81, and a computer program 82 stored in the memory 61 and executable on the at least one processor 80, the processor 80 implementing the steps in any of the various pose estimation method embodiments described above when executing the computer program 82.

The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 8 is merely an example of the terminal device 8, and does not constitute a limitation of the terminal device 8, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, and the like.

The Processor 80 may be a Central Processing Unit (CPU), and the Processor 80 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 81 may in some embodiments be an internal storage unit of the terminal device 8, such as a hard disk or a memory of the terminal device 8. In other embodiments, the memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 8. Further, the memory 81 may also include both an internal storage unit and an external storage device of the terminal device 8. The memory 81 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 81 may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to an apparatus/terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A pose estimation method applied to a robot, the method comprising:

2. The pose estimation method according to claim 1, wherein the detecting of the feature points and the feature line segments in the first target video includes:

3. The pose estimation method according to claim 1, wherein the matching of the feature line segments in the first target video to obtain first line feature information comprises:

4. The pose estimation method according to claim 1, wherein after estimating the pose of the robot from the first inertial navigation data, the first point feature information, and the first line feature information, the method further comprises:

5. The pose estimation method according to claim 1, wherein after acquiring the first inertial data within the time period corresponding to the first target video, the method further comprises:

6. The pose estimation method according to claim 5, wherein the associating processing of the scan points in the first point cloud data and the feature points in the first target video to obtain depth values of the feature points in the first target video includes:

7. The pose estimation method according to claim 5, wherein after estimating the pose of the robot from the depth values of the feature points in the first target video, the first inertial navigation data, the first point feature information, and the first line feature information, the method further comprises:

8. A pose estimation apparatus applied to a robot, the apparatus comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.