CN114185073A

CN114185073A - Pose display method, device and system

Info

Publication number: CN114185073A
Application number: CN202111350621.9A
Authority: CN
Inventors: 李佳宁; 李�杰; 毛慧; 浦世亮
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-03-15
Also published as: WO2023083256A1

Abstract

The application provides a pose display method, a pose display device and a pose display system, wherein the method comprises the following steps: the method comprises the steps that terminal equipment obtains a target image of a target scene and motion data of the terminal equipment, and a self-positioning track is determined based on the target image and the motion data; selecting partial images from the multi-frame images as images to be detected, and sending the images to be detected and the self-positioning track to a server; the server generates a fusion positioning track based on the image to be detected and the self-positioning track, wherein the fusion positioning track comprises a plurality of fusion positioning poses; and aiming at each fusion positioning pose, the server determines a target positioning pose corresponding to the fusion positioning pose and displays the target positioning pose. According to the technical scheme, the positioning function with high frame rate and high precision is achieved, the terminal equipment only sends the self-positioning track and the image to be detected, the data volume of network transmission is reduced, and the computing resource consumption and the storage resource consumption of the terminal equipment are reduced.

Description

Pose display method, device and system

Technical Field

The application relates to the field of computer vision, in particular to a pose display method, a pose display device and a pose display system.

Background

The GPS (Global Positioning System) is a high-precision radio navigation Positioning System based on artificial earth satellites, and can provide accurate geographic position, vehicle speed and time information anywhere in the world and in the near-earth space. The Beidou satellite navigation system consists of a space section, a ground section and a user section, can provide high-precision, high-reliability positioning, navigation and time service for users all day long in the global range, and has regional navigation, positioning and time service capabilities.

Because the terminal equipment is provided with the GPS or the Beidou satellite navigation system, the GPS or the Beidou satellite navigation system can be adopted to position the terminal equipment when the terminal equipment needs to be positioned. Under the outdoor environment, because GPS signal or big dipper signal are better, can adopt GPS or big dipper satellite navigation system to carry out accurate positioning to terminal equipment. However, in an indoor environment, the GPS or beidou satellite navigation system cannot accurately position the terminal device because the GPS signal or beidou signal is poor. For example, in energy industries such as coal, electric power, petrochemical industry, and the like, the positioning needs are more and more, and these positioning needs are generally in indoor environments, and due to the problems such as signal shielding, accurate positioning of terminal equipment cannot be performed.

Disclosure of Invention

The application provides a pose display method, which is applied to a cloud side management system, wherein the cloud side management system comprises terminal equipment and a server, the server comprises a three-dimensional visual map of a target scene, and the method comprises the following steps:

the method comprises the steps that in the moving process of a target scene, the terminal equipment obtains a target image of the target scene and motion data of the terminal equipment, and determines a self-positioning track of the terminal equipment based on the target image and the motion data; if the target image comprises a plurality of frames of images, selecting a partial image from the plurality of frames of images as an image to be detected, and sending the image to be detected and the self-positioning track to a server;

the server generates a fused positioning track of the terminal equipment in the three-dimensional visual map based on the image to be detected and the self-positioning track, wherein the fused positioning track comprises a plurality of fused positioning poses;

and aiming at each fusion positioning pose in the fusion positioning track, the server determines a target positioning pose corresponding to the fusion positioning pose and displays the target positioning pose.

The application provides a cloud management system, cloud management system includes terminal equipment and server, the server includes the three-dimensional visual map of target scene, wherein:

the terminal device is used for acquiring a target image of a target scene and motion data of the terminal device in the moving process of the target scene, and determining a self-positioning track of the terminal device based on the target image and the motion data; if the target image comprises a plurality of frames of images, selecting a partial image from the plurality of frames of images as an image to be detected, and sending the image to be detected and the self-positioning track to a server;

the server is used for generating a fused positioning track of the terminal equipment in the three-dimensional visual map based on the image to be detected and the self-positioning track, and the fused positioning track comprises a plurality of fused positioning poses; and aiming at each fusion positioning pose in the fusion positioning track, determining a target positioning pose corresponding to the fusion positioning pose, and displaying the target positioning pose.

The application provides a position appearance display device, is applied to the server among the cloud limit management system, the server includes the three-dimensional visual map of target scene, the device includes:

the acquisition module is used for acquiring an image to be detected and a self-positioning track; the self-positioning track is determined by terminal equipment based on a target image of the target scene and motion data of the terminal equipment, and the image to be detected is a partial image in a multi-frame image included in the target image;

a generating module, configured to generate a fused positioning track of the terminal device in the three-dimensional visual map based on the image to be detected and the self-positioning track, where the fused positioning track includes multiple fused positioning poses;

and the display module is used for determining a target positioning pose corresponding to each fusion positioning pose in the fusion positioning track and displaying the target positioning pose.

According to the technical scheme, the cloud-edge combined positioning and displaying method is provided, terminal equipment at the edge end collects target images and motion data, high-frame-rate self-positioning is carried out according to the target images and the motion data, and a high-frame-rate self-positioning track is obtained. The cloud server receives an image to be detected and a self-positioning track sent by the terminal equipment, obtains a high-frame-rate fusion positioning track according to the image to be detected and the self-positioning track, namely the high-frame-rate fusion positioning track in the three-dimensional visual map, realizes a high-frame-rate and high-precision positioning function, realizes a high-precision, low-cost and easily-deployed indoor positioning function, is an indoor positioning mode based on vision, and can display the fusion positioning track. In the above mode, the terminal device calculates the self-positioning track with a high frame rate, and only sends the self-positioning track and a small number of images to be detected, so as to reduce the data volume transmitted by the network. And global positioning is carried out on the server, so that the consumption of computing resources and the consumption of storage resources of the terminal equipment are reduced. The device can be applied to energy industries such as coal, electric power and petrochemical industry, indoor positioning of personnel (such as workers and inspection personnel) is achieved, position information of the personnel is rapidly obtained, and personnel safety is guaranteed.

Drawings

Fig. 1 is a schematic flowchart of a pose display method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a cloud edge management system according to an embodiment of the present application;

FIG. 3 is a schematic flow chart for determining a self-positioning trajectory according to an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating a process for determining a global localization track according to an embodiment of the present application;

FIG. 5 is a schematic illustration of a self-localizing track, a global localizing track, and a fused localizing track;

FIG. 6 is a schematic flow chart illustrating a process for determining a fused localization track according to an embodiment of the present application;

fig. 7 is a schematic structural view of a pose display apparatus in an embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".

The embodiment of the application provides a pose display method, which can be applied to a cloud management system, where the cloud management system may include a terminal device (i.e., a terminal device at an edge end) and a server (i.e., a server at a cloud end), and the server may include a three-dimensional visual map of a target scene (e.g., an indoor environment, an outdoor environment, etc.), as shown in fig. 1, a flowchart of the pose display method is shown, and the method may include:

step 101, in the moving process of a target scene, the terminal device obtains a target image of the target scene and motion data of the terminal device, and determines a self-positioning track of the terminal device based on the target image and the motion data.

Illustratively, if the target image comprises a multi-frame image, the terminal device traverses the current frame image from the multi-frame image; determining a self-positioning pose corresponding to the current frame image based on a self-positioning pose corresponding to a K frame image in front of the current frame image, a map position of the terminal equipment in a self-positioning coordinate system and the motion data; and generating a self-positioning track of the terminal equipment in a self-positioning coordinate system based on the self-positioning poses corresponding to the multi-frame images.

For example, if the current frame image is a key image, the map position in the self-localization coordinate system may be generated based on the current position of the terminal device (i.e., the position corresponding to the current frame image). If the current frame image is a non-key image, the map position in the self-positioning coordinate system does not need to be generated based on the current position of the terminal equipment.

And if the number of the matched feature points between the current frame image and the previous frame image of the current frame image does not reach a preset threshold value, determining that the current frame image is a key image. And if the number of the matched feature points between the current frame image and the previous frame image of the current frame image reaches a preset threshold value, determining that the current frame image is a non-key image.

And 102, if the target image comprises a plurality of frames of images, the terminal equipment selects a part of the images from the plurality of frames of images as an image to be detected, and sends the image to be detected and the self-positioning track to a server.

For example, the terminal device may select M frames of images from the multiple frames of images as the image to be measured, where M may be a positive integer, such as 1, 2, 3, and the like. Obviously, the terminal device sends the server a part of the images to be detected in the multi-frame images, so that the data volume of network transmission can be reduced, and the network bandwidth resources are saved.

103, the server generates a fusion positioning track of the terminal device in the three-dimensional visual map based on the image to be detected and the self-positioning track, wherein the fusion positioning track can comprise a plurality of fusion positioning poses.

For example, the server may determine a target map point corresponding to the image to be measured from a three-dimensional visual map of the target scene, and determine a global positioning track of the terminal device in the three-dimensional visual map based on the target map point. Then, the server generates a fused positioning track of the terminal device in the three-dimensional visual map based on the self-positioning track and the global positioning track. For example, the frame rate of the fusion localization poses included in the fusion localization track may be greater than the frame rate of the global localization poses included in the global localization track, that is, the frame rate of the fusion localization track is higher than the frame rate of the global localization track, the fusion localization track may be a high frame rate pose in a three-dimensional visual map, the global localization track may be a low frame rate pose in the three-dimensional visual map, and the frame rate of the fusion localization track is higher than the frame rate of the global localization track, which indicates that the number of the fusion localization poses is greater than the number of the global localization poses. Further, the frame rate of the fused localization pose included by the fused localization track may be equal to the frame rate of the self-localization pose included by the self-localization track, i.e., the frame rate of the fused localization track is equal to the frame rate of the self-localization track, i.e., the self-localization track may be a high frame rate pose. And the frame rate of the fusion positioning tracks is equal to the frame rate of the self-positioning tracks, and the number of the fusion positioning poses is equal to the number of the self-positioning poses.

In one possible embodiment, the three-dimensional visual map may include, but is not limited to, at least one of: the pose matrix corresponding to the sample image, the sample global descriptor corresponding to the sample image, the sample local descriptor corresponding to the characteristic point in the sample image and the map point information. The server determines a target map point corresponding to the image to be detected from a three-dimensional visual map of a target scene, and determines a global positioning track of the terminal device in the three-dimensional visual map based on the target map point, which may include but is not limited to: and aiming at each frame of image to be detected, selecting candidate sample images from the multi-frame sample images based on the similarity between the image to be detected and the multi-frame sample images corresponding to the three-dimensional visual map. Acquiring a plurality of characteristic points from the image to be detected; and for each feature point, determining a target map point corresponding to the feature point from a plurality of map points corresponding to the candidate sample image. And determining the global positioning pose in the three-dimensional visual map corresponding to the image to be detected based on the plurality of feature points and the target map points corresponding to the plurality of feature points. And generating a global positioning track of the terminal equipment in the three-dimensional visual map based on the global positioning poses corresponding to all the images to be detected.

The server selects a candidate sample image from the multiple frame sample images based on the similarity between the image to be detected and the multiple frame sample images corresponding to the three-dimensional visual map, and the selecting may include: determining a global descriptor to be detected corresponding to the image to be detected, and determining the distance between the global descriptor to be detected and a sample global descriptor corresponding to each frame of sample image corresponding to the three-dimensional visual map; the three-dimensional visual map at least comprises a sample global descriptor corresponding to each frame of sample image. Selecting candidate sample images from the multi-frame sample images based on the distance between the global descriptor to be detected and each sample global descriptor; the distance between the global descriptor to be tested and the sample global descriptor corresponding to the candidate sample image is the minimum distance; or, the distance between the global descriptor to be tested and the sample global descriptor corresponding to the candidate sample image is smaller than the distance threshold.

The server determines a global descriptor to be tested corresponding to the image to be tested, which may include but is not limited to: determining a bag-of-words vector corresponding to the image to be detected based on the trained dictionary model, and determining the bag-of-words vector as a global descriptor to be detected corresponding to the image to be detected; or inputting the image to be detected to the trained deep learning model to obtain a target vector corresponding to the image to be detected, and determining the target vector as a global descriptor to be detected corresponding to the image to be detected. Of course, the above is only an example of determining the global descriptor to be tested, and the method is not limited thereto.

The server determines a target map point corresponding to the feature point from a plurality of map points corresponding to the candidate sample image, which may include but is not limited to: and determining a local descriptor to be detected corresponding to the feature point, wherein the local descriptor to be detected is used for representing the feature vector of the image block where the feature point is located, and the image block can be located in the image to be detected. Determining the distance between the local descriptor to be tested and the sample local descriptor corresponding to each map point corresponding to the candidate sample image; wherein, the three-dimensional visual map at least comprises a sample local descriptor corresponding to each map point corresponding to the candidate sample image. Then, a target map point can be selected from a plurality of map points corresponding to the candidate sample image based on the distance between the local descriptor to be detected and each sample local descriptor; the distance between the local descriptor to be detected and the sample local descriptor corresponding to the target map point may be a minimum distance, and the minimum distance is smaller than a distance threshold.

The server generates a fused positioning track of the terminal device in the three-dimensional visual map based on the self-positioning track and the global positioning track, which may include but is not limited to: the server can select N self-positioning poses corresponding to the target time period from all the self-positioning poses included in the self-positioning track, and select P global positioning poses corresponding to the target time period from all the global positioning poses included in the global positioning track; wherein N is greater than P. And determining N fusion positioning poses corresponding to the N self-positioning poses based on the N self-positioning poses and the P global positioning poses, wherein the N self-positioning poses correspond to the N fusion positioning poses one by one. And generating a fusion positioning track of the terminal equipment in the three-dimensional visual map based on the N fusion positioning poses.

After generating the fused positioning track of the terminal device in the three-dimensional visual map based on the self-positioning track and the global positioning track, the server may further select an initial fused positioning pose from the fused positioning track, and select an initial self-positioning pose corresponding to the initial fused positioning pose from the self-positioning track. And selecting a target self-positioning pose from the self-positioning track, and determining a target fusion positioning pose based on the initial fusion positioning pose, the initial self-positioning pose and the target self-positioning pose. And then, generating a new fusion positioning track based on the target fusion positioning pose and the fusion positioning track to replace the original fusion positioning track.

And step 104, aiming at each fusion positioning pose in the fusion positioning track, the server determines a target positioning pose corresponding to the fusion positioning pose and displays the target positioning pose.

For example, the server may determine the fused positioning pose as a target positioning pose and display the target positioning pose in a three-dimensional visual map. Or the server converts the fusion positioning pose into a target positioning pose in the three-dimensional visual map based on a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map, and displays the target positioning pose through the three-dimensional visual map.

For example, the determination manner of the target transformation matrix between the three-dimensional visual map and the three-dimensional visual map may include, but is not limited to: for each of a plurality of calibration points, a coordinate pair corresponding to the calibration point may be determined, where the coordinate pair may include a position coordinate of the calibration point in a three-dimensional visual map and a position coordinate of the calibration point in the three-dimensional visual map; and determining the target transformation matrix based on the coordinate pairs corresponding to the plurality of calibration points. Or acquiring an initial transformation matrix, mapping the position coordinates in the three-dimensional visual map into mapping coordinates in the three-dimensional visual map based on the initial transformation matrix, and determining whether the initial transformation matrix is converged based on the relation between the mapping coordinates and actual coordinates in the three-dimensional visual map; if yes, determining the initial transformation matrix as a target transformation matrix; if not, the initial transformation matrix is adjusted, the adjusted transformation matrix is used as the initial transformation matrix, the operation of mapping the position coordinates in the three-dimensional visual map into the mapping coordinates in the three-dimensional visual map based on the initial transformation matrix is returned, and the like is performed until the target transformation matrix is obtained. Or sampling the three-dimensional visual map to obtain a first point cloud corresponding to the three-dimensional visual map; sampling the three-dimensional visual map to obtain a second point cloud corresponding to the three-dimensional visual map; and registering the first point cloud and the second point cloud by adopting an ICP (inductively coupled plasma) algorithm to obtain a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map.

According to the technical scheme, the cloud-edge combined positioning and displaying method is provided, terminal equipment at the edge end collects target images and motion data, high-frame-rate self-positioning is carried out according to the target images and the motion data, and a high-frame-rate self-positioning track is obtained. The cloud server receives an image to be detected and a self-positioning track sent by the terminal equipment, obtains a high-frame-rate fusion positioning track according to the image to be detected and the self-positioning track, namely the high-frame-rate fusion positioning track in the three-dimensional visual map, realizes a high-frame-rate and high-precision positioning function, realizes a high-precision, low-cost and easily-deployed indoor positioning function, is a vision-based indoor positioning mode, and can display the fusion positioning track in the three-dimensional visual map. In the above mode, the terminal device calculates the self-positioning track with a high frame rate, and only sends the self-positioning track and a small number of images to be detected, so as to reduce the data volume transmitted by the network. And global positioning is carried out on the server, so that the consumption of computing resources and the consumption of storage resources of the terminal equipment are reduced. The device can be applied to energy industries such as coal, electric power and petrochemical industry, indoor positioning of personnel (such as workers and inspection personnel) is achieved, position information of the personnel is rapidly obtained, and personnel safety is guaranteed.

The following describes a pose display method according to an embodiment of the present application with reference to specific embodiments.

The embodiment of the application provides a cloud-edge combined visual positioning and displaying method. The target scene may be an indoor environment, that is, when the terminal device moves in the indoor environment, the server determines the fusion positioning track of the terminal device in the three-dimensional visual map, that is, an indoor positioning mode based on the vision is proposed, and of course, the target scene may also be an outdoor environment, which is not limited to this.

Referring to fig. 2, a schematic structural diagram of the cloud-edge management system is shown, where the cloud-edge management system may include a terminal device (i.e., a terminal device at an edge end) and a server (i.e., a server at a cloud end), and of course, the cloud-edge management system may further include other devices, such as a wireless base station and a router, which is not limited thereto. The server can include a three-dimensional visual map of a target scene and a three-dimensional visual map corresponding to the three-dimensional visual map, and can generate a fusion positioning track of the terminal device in the three-dimensional visual map and display the fusion positioning track in the three-dimensional visual map (the fusion positioning track needs to be converted into a track capable of being displayed in the three-dimensional visual map), so that a manager can view the fusion positioning track in the three-dimensional visual map through a web end.

The terminal device may include a visual sensor, a motion sensor, and the like, where the visual sensor may be a camera, and the visual sensor is configured to acquire an image of a target scene during movement of the terminal device, and for convenience of distinguishing, the image is recorded as a target image, and the target image includes multiple frames of images (i.e., multiple frames of real-time images during movement of the terminal device). The motion sensor may be, for example, an IMU (Inertial Measurement Unit), which is a Measurement device including a gyroscope and an accelerometer, and is used to acquire motion data of the terminal device, such as acceleration and angular velocity, during movement of the terminal device.

For example, the terminal device may be a wearable device (e.g., a video helmet, a smart watch, smart glasses, etc.), and the visual sensor and the motion sensor are disposed on the wearable device; or the terminal equipment is a recorder (for example, the terminal equipment is carried by a worker during work and has the functions of collecting video and audio in real time, taking pictures, recording, talkbacking, positioning and the like), and the visual sensor and the motion sensor are arranged on the recorder; alternatively, the terminal device is a camera (such as a split camera), and the vision sensor and the motion sensor are disposed on the camera. Of course, the above is only an example, and the type of the terminal device is not limited, for example, the terminal device may also be a smartphone, and the like, as long as a vision sensor and a motion sensor are deployed.

For example, the terminal device may acquire the target image and the motion data, perform high-frame-rate self-positioning according to the target image and the motion data, and obtain a high-frame-rate self-positioning trajectory (e.g., a 6DOF (six degrees of freedom) self-positioning trajectory), where the self-positioning trajectory may include multiple self-positioning poses, and since the self-positioning trajectory is a high-frame-rate self-positioning trajectory, the number of self-positioning poses in the self-positioning trajectory is large.

The terminal device can select a part of images from multi-frame images of the target image as images to be detected, and sends the self-positioning track with the high frame rate and the images to be detected to the server. The server can obtain a self-positioning track and an image to be detected, the server can perform global positioning at a low frame rate according to the image to be detected and a three-dimensional visual map of a target scene, and obtain a global positioning track (namely the global positioning track of the image to be detected in the three-dimensional visual map) at the low frame rate, the global positioning track can comprise a plurality of global positioning poses, and the global positioning track is the global positioning track at the low frame rate, so that the number of the global positioning poses in the global positioning track is small.

Based on the high-frame-rate self-positioning track and the low-frame-rate global positioning track, the server can fuse the high-frame-rate self-positioning track and the low-frame-rate global positioning track to obtain a high-frame-rate fusion positioning track, namely a high-frame-rate fusion positioning track in the three-dimensional visual map, so that a high-frame-rate global positioning result is obtained. The fusion positioning track can comprise a plurality of fusion positioning poses, and the fusion positioning track is a high-frame-rate fusion positioning track, so that the number of the fusion positioning poses in the fusion positioning track is large.

In the above embodiments, the pose (e.g., self-positioning pose, global positioning pose, fusion positioning pose, etc.) may be a position and a pose, and is generally represented by a rotation matrix and a translation vector, which is not limited to this.

In summary, in this embodiment, based on the target image and the motion data, a globally unified high frame rate visual positioning function can be implemented, a fused positioning track (e.g., 6DOF pose) of a high frame rate in the three-dimensional visual map is obtained, and the method is a high frame rate globally consistent positioning method, and is an indoor positioning function that is high in frame rate, high in precision, low in cost, and easy to deploy of the terminal device, and is implemented.

The above process of the embodiment of the present application is described in detail below with reference to specific application scenarios.

Firstly, self-positioning of the terminal equipment. The terminal device is an electronic device with a vision sensor and a motion sensor, and can acquire a target image (such as a continuous video image) of a target scene and motion data (such as IMU data) of the terminal device and determine a self-positioning track of the terminal device based on the target image and the motion data.

The target image may include multiple frames of images, and for each frame of image, the terminal device determines a self-positioning pose corresponding to the image, that is, the multiple frames of image correspond to multiple self-positioning poses, and the self-positioning trajectory of the terminal device may include multiple self-positioning poses, which may be understood as a set of multiple self-positioning poses.

The method comprises the steps that for a first frame image in a multi-frame image, the terminal equipment determines a self-positioning pose corresponding to the first frame image, for a second frame image in the multi-frame image, the terminal equipment determines a self-positioning pose corresponding to the second frame image, and the like. The self-positioning pose corresponding to the first frame image can be a coordinate origin of a reference coordinate system (namely, a self-positioning coordinate system), the self-positioning pose corresponding to the second frame image is a pose point in the reference coordinate system, namely, a pose point relative to the coordinate origin (namely, the self-positioning pose corresponding to the first frame image), the self-positioning pose corresponding to the third frame image is a pose point in the reference coordinate system, namely, a pose point relative to the coordinate origin, and so on, and the self-positioning poses corresponding to the frames of images are pose points in the reference coordinate system.

In summary, after obtaining the self-positioning poses corresponding to each frame of image, the self-positioning poses can be combined into a self-positioning track in the reference coordinate system, and the self-positioning track comprises the self-positioning poses.

In one possible embodiment, as shown in fig. 3, the self-localization trajectory is determined by the following steps:

step 301, acquiring a target image of a target scene and motion data of the terminal device.

Step 302, traversing the current frame image from the multiple frame images if the target image comprises the multiple frame images.

When the first frame image is traversed from the multiple frame images as the current frame image, the self-positioning pose corresponding to the first frame image may be a coordinate origin of a reference coordinate system (i.e., a self-positioning coordinate system), that is, the self-positioning pose coincides with the coordinate origin. When the second frame image is traversed from the multi-frame image as the current frame image, the self-positioning pose corresponding to the second frame image can be determined by adopting the subsequent steps. When a third frame image is traversed from the multi-frame image to serve as a current frame image, the self-positioning pose corresponding to the third frame image can be determined by adopting the subsequent steps, and by analogy, each frame image can be traversed to serve as the current frame image.

Step 303, calculating a feature point association relationship between the current frame image and the previous frame image of the current frame image by using an optical flow algorithm. The optical flow algorithm is a method for finding out the corresponding relation between a current frame image and a previous frame image by using the change of pixels in the current frame image in a time domain and the correlation between the previous frame images, so as to calculate the motion information of an object between the current frame image and the previous frame image.

And step 304, determining whether the current frame image is a key image or not based on the number of the matched feature points between the current frame image and the previous frame image. For example, if the number of matching feature points between the current frame image and the previous frame image does not reach the preset threshold, the method is used to indicate that the change between the current frame image and the previous frame image is large, and the number of matching feature points between the two frame images is small, and then it is determined that the current frame image is a key image, and step 305 is executed. If the number of the matching feature points between the current frame image and the previous frame image reaches the preset threshold, the method is used for indicating that the change of the current frame image and the previous frame image is small, so that the number of the matching feature points between the two frame images is large, determining that the current frame image is a non-key image, and executing step 306.

For example, the matching ratio between the current frame image and the previous frame image, for example, the ratio of the number of matching feature points to the total number of feature points, may also be calculated based on the number of matching feature points between the current frame image and the previous frame image. And if the matching proportion does not reach the preset proportion, determining that the current frame image is the key image, and if the matching proportion reaches the preset proportion, determining that the current frame image is the non-key image.

Step 305, if the current frame image is a key image, generating a map position in a self-positioning coordinate system (i.e. a reference coordinate system) based on the current position of the terminal device (i.e. the position where the current frame image is acquired by the terminal device), i.e. generating a new 3D map position. If the current frame image is a non-key image, the map position in the self-positioning coordinate system does not need to be generated based on the current position of the terminal equipment.

Step 306, determining a self-positioning pose corresponding to the current frame image based on a self-positioning pose corresponding to a K frame image in front of the current frame image, a map position of the terminal device in a self-positioning coordinate system and motion data of the terminal device, wherein K may be a positive integer, may be a value configured according to experience, and is not limited.

For example, all motion data between a previous frame image and a current frame image of the current frame image may be pre-integrated to obtain an inertial measurement constraint between the two frame images. Based on the self-positioning pose and motion data (such as speed, acceleration, angular velocity and the like) corresponding to the K frame images (such as a sliding window) in front of the current frame image, the map position in a self-positioning coordinate system and inertial measurement constraints (such as speed, acceleration, angular velocity and the like between the previous frame image and the current frame image), the self-positioning pose corresponding to the current frame image can be obtained by adopting bundling optimization and updating in a combined optimization mode, and the bundling optimization process is not limited.

For example, in order to maintain the scale of the variable to be optimized, a certain frame and a part of the map position in the sliding window may be marginalized, and the constraint information is retained in a priori form.

For example, the terminal device may determine the self-positioning pose by using a VIO (Visual Inertial odometer) algorithm, that is, input data of the VIO algorithm is target image and motion data, output data of the VIO algorithm is the self-positioning pose, for example, the VIO algorithm may obtain the self-positioning pose based on the target image and the motion data, for example, the VIO algorithm performs steps 301 to 306 to obtain the self-positioning pose. The VIO algorithm may include, but is not limited to, VINS (Visual Inertial Navigation Systems), SVO (Semi-direct Visual odometer), MSCKF (Multi State constrained Kalman Filter), and the like, and is not limited herein as long as a self-positioning pose can be obtained.

And 307, generating a self-positioning track of the terminal device in a self-positioning coordinate system based on self-positioning poses corresponding to the multi-frame images, wherein the self-positioning track comprises a plurality of self-positioning poses in the self-positioning coordinate system.

Therefore, the terminal device can obtain the self-positioning track in the self-positioning coordinate system, the self-positioning track can comprise self-positioning poses corresponding to multiple frames of images, obviously, the vision sensor can collect a large number of images, so that the terminal device can obtain the self-positioning poses corresponding to the images, namely, the self-positioning track can comprise a large number of self-positioning poses, namely, the terminal device can obtain the self-positioning track with a high frame rate.

And secondly, data transmission. If the target image comprises a plurality of frames of images, the terminal device can select a part of the images from the plurality of frames of images as an image to be detected and send the image to be detected and the self-positioning track to the server. For example, the terminal device sends the self-positioning track and the image to be measured to the server through a wireless network (e.g. 4G, 5G, Wifi, etc.), and the frame rate of the image to be measured is low, so that the occupied network bandwidth is small.

And thirdly, three-dimensional visual map of the target scene. The three-dimensional visual map of the target scene needs to be constructed in advance, and the three-dimensional visual map is stored in the server, so that the server can perform global positioning based on the three-dimensional visual map. The three-dimensional visual map is a storage method for image information of a target scene, And may be configured to collect multiple frame sample images of the target scene, And construct the three-dimensional visual map based on the sample images, for example, based on the multiple frame sample images of the target scene, the three-dimensional visual map of the target scene may be constructed by using visual Mapping algorithms such as SFM (Structure From Motion) or SLAM (Simultaneous Localization And Mapping), And the like, And the construction method is not limited.

After obtaining the three-dimensional visual map of the target scene, the three-dimensional visual map may include the following information:

pose of sample image: the sample image is a representative image when the three-dimensional visual map is constructed, that is, the three-dimensional visual map can be constructed based on the sample image, the pose matrix of the sample image (which may be referred to as sample image pose for short) can be stored in the three-dimensional visual map, and the three-dimensional visual map can include the pose of the sample image.

Sample global descriptor: for each frame of sample image, the sample image may correspond to an image global descriptor, and the image global descriptor is denoted as a sample global descriptor, where the sample global descriptor represents the sample image by using a high-dimensional vector, and the sample global descriptor is used to distinguish image features of different sample images.

For each frame of sample image, determining a bag-of-words vector corresponding to the sample image based on the trained dictionary model, and determining the bag-of-words vector as a sample global descriptor corresponding to the sample image. For example, a Bag of Words (Bag of Words) method is a way for determining a global descriptor, and in the Bag of Words method, a Bag of Words vector can be constructed, which is a vector representation method for image similarity detection, and the Bag of Words vector can be used as a sample global descriptor corresponding to a sample image.

In the visual bag-of-words method, a "dictionary", also called a dictionary model, needs to be trained in advance, and generally, a classification tree is obtained by clustering feature point descriptors in a large number of images and training, each classification tree can represent a visual "word", and the visual "words" form the dictionary model.

For a sample image, all feature point descriptors in the sample image may be classified as words, and the occurrence frequency of all words is counted, so that the frequency of each word in a dictionary may form a vector, the vector is a bag-of-word vector corresponding to the sample image, the bag-of-word vector may be used to measure the similarity of two images, and the bag-of-word vector is used as a sample global descriptor corresponding to the sample image.

For each frame of sample image, the sample image may be input to a trained deep learning model to obtain a target vector corresponding to the sample image, and the target vector is determined as a sample global descriptor corresponding to the sample image. For example, a deep learning method is a method for determining a global descriptor, in the deep learning method, a sample image may be subjected to multilayer convolution through a deep learning model, and a high-dimensional target vector is finally obtained, and the target vector is used as the sample global descriptor corresponding to the sample image.

In the deep learning method, a deep learning model, such as a CNN (Convolutional Neural Networks) model, needs to be trained in advance, and the deep learning model is generally obtained by training a large number of images, and the training mode of the deep learning model is not limited. For a sample image, the sample image may be input to a deep learning model, the deep learning model processes the sample image to obtain a high-dimensional target vector, and the target vector is used as a sample global descriptor corresponding to the sample image.

Sample local descriptors corresponding to feature points of the sample image: for each frame of sample image, the sample image may include a plurality of feature points, where a feature point may be a specific pixel position in the sample image, the feature point may correspond to an image local descriptor, and the image local descriptor is recorded as a sample local descriptor, where the sample local descriptor describes features of image blocks in a range near the feature point (i.e., the pixel position) with a vector, and the vector may also be referred to as a descriptor of the feature point. In summary, the sample local descriptor is a feature vector for representing an image block where the feature point is located, and the image block may be located in the sample image. It should be noted that, for a feature point (i.e., a two-dimensional feature point) in a sample image, the feature point may correspond to a map point (i.e., a three-dimensional map point) in a three-dimensional visual map, and therefore, the sample local descriptor corresponding to the feature point may also be a sample local descriptor corresponding to the map point corresponding to the feature point.

Wherein, algorithms such as ORB (Oriented FAST and Rotated FAST Transform), SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features), and the like can be adopted to extract Feature points from the sample image and determine the sample local descriptors corresponding to the Feature points. A deep learning algorithm (such as SuperPoint, DELF, D2-Net, etc.) may also be used to extract feature points from the sample image and determine a sample local descriptor corresponding to the feature points, which is not limited to this, as long as the feature points can be obtained and the sample local descriptor can be determined.

Map point information: map point information may include, but is not limited to: the 3D spatial location of the map point, all observed sample images, and the corresponding 2D feature point (i.e., the feature point corresponding to the map point) number.

And fourthly, global positioning of the server. Based on the acquired three-dimensional visual map of the target scene, after the server obtains the image to be detected, the server determines a target map point corresponding to the image to be detected from the three-dimensional visual map of the target scene, and determines a global positioning track of the terminal equipment in the three-dimensional visual map based on the target map point.

For each frame of image to be detected, the server may determine a global positioning pose corresponding to the image to be detected, assuming that M frames of images to be detected exist, the M frames of images to be detected correspond to M global positioning poses, and a global positioning track of the terminal device in the three-dimensional visual map may include M global positioning poses, which may be understood as a set of M global positioning poses. And determining the global positioning pose corresponding to the first frame of image to be detected in the M frames of images to be detected, determining the global positioning pose corresponding to the second frame of image to be detected in the second frame of image to be detected, and so on. For each global positioning pose, the global positioning pose is a pose point in the three-dimensional visual map, i.e. a pose point in the coordinate system of the three-dimensional visual map. In summary, after obtaining the global positioning poses corresponding to the M frames of images to be detected, the global positioning poses are combined into a global positioning track in the three-dimensional visual map, and the global positioning track includes the global positioning poses.

Based on the three-dimensional visual map of the target scene, in one possible implementation, referring to fig. 4, the server may determine the global positioning track of the terminal device in the three-dimensional visual map by using the following steps:

step 401, the server obtains an image to be detected of a target scene from the terminal device.

For example, the terminal device may acquire a target image, where the target image includes a multi-frame image, and the terminal device may select M frames of images from the multi-frame image as an image to be detected and send the M frames of images to be detected to the server. For example, the multi-frame image includes a key image and a non-key image, and on this basis, the terminal device may use the key image in the multi-frame image as the image to be measured, and the non-key image is not used as the image to be measured. For another example, the terminal device may select an image to be measured from the multiple frames of images at fixed intervals, and assuming that the fixed interval is 5 (of course, the fixed interval may be arbitrarily configured according to experience, and is not limited thereto), may select the 1 st frame of image as the image to be measured, the 6(1+5) th frame of image as the image to be measured, the 11(6+5) th frame of image as the image to be measured, and so on, and select one frame of image to be measured every 5 frames of images.

Step 402, determining a global descriptor to be detected corresponding to each frame of image to be detected.

For each frame of image to be detected, the image to be detected may correspond to an image global descriptor, and the image global descriptor may be recorded as an image to be detected, where the image to be detected is represented by a high-dimensional vector, and the image to be detected is used to distinguish image features of different images to be detected.

And determining a bag-of-words vector corresponding to each frame of image to be detected based on the trained dictionary model, and determining the bag-of-words vector as a global descriptor to be detected corresponding to the image to be detected. Or, for each frame of image to be detected, inputting the image to be detected to the trained deep learning model to obtain a target vector corresponding to the image to be detected, and determining the target vector as a global descriptor to be detected corresponding to the image to be detected.

In summary, the global descriptor to be detected corresponding to the image to be detected may be determined based on a visual bag-of-words method or a deep learning method, and the determination manner refers to the determination manner of the sample global descriptor, which is not described herein again.

Step 403, determining, for each frame of image to be detected, a similarity between the global descriptor to be detected corresponding to the image to be detected and the sample global descriptor corresponding to each frame of sample image corresponding to the three-dimensional visual map.

Referring to the above embodiment, the three-dimensional visual map may include a sample global descriptor corresponding to each frame of sample image, and therefore, a similarity between the global descriptor to be measured and each sample global descriptor may be determined, and taking the similarity as "distance similarity" as an example, a distance between the global descriptor to be measured and each sample global descriptor may be determined, such as a euclidean distance, that is, a euclidean distance between two feature vectors is calculated.

Step 404, selecting candidate sample images from the multi-frame sample images corresponding to the three-dimensional visual map based on the distance between the global descriptor to be detected and each sample global descriptor; the distance between the global descriptor to be tested and the sample global descriptor corresponding to the candidate sample image is the minimum distance; or, the distance between the global descriptor to be tested and the sample global descriptor corresponding to the candidate sample image is smaller than the distance threshold.

For example, assuming that the three-dimensional visual map corresponds to the sample image 1, the sample image 2, and the sample image 3, the distance 1 between the global descriptor to be measured and the sample global descriptor corresponding to the sample image 1 may be calculated, the distance 2 between the global descriptor to be measured and the sample global descriptor corresponding to the sample image 2 may be calculated, and the distance 3 between the global descriptor to be measured and the sample global descriptor corresponding to the sample image 3 may be calculated.

In one possible embodiment, if the distance 1 is the minimum distance, the sample image 1 is selected as the candidate sample image. Alternatively, if the distance 1 is smaller than the distance threshold (which may be configured empirically), and the distance 2 is smaller than the distance threshold, but the distance 3 is not smaller than the distance threshold, then both the sample image 1 and the sample image 2 are selected as candidate sample images. Or, if the distance 1 is the minimum distance and the distance 1 is smaller than the distance threshold, the sample image 1 is selected as the candidate sample image, but if the distance 1 is the minimum distance and the distance 1 is not smaller than the distance threshold, the candidate sample image cannot be selected, that is, the relocation fails.

In summary, for each frame of the image to be measured, the candidate sample image corresponding to the image to be measured may be selected from the multiple frame sample images corresponding to the three-dimensional visual map, where the number of the candidate sample images is at least one.

Step 405, for each frame of image to be detected, obtaining a plurality of feature points from the image to be detected, and for each feature point, determining a local descriptor to be detected corresponding to the feature point, where the local descriptor to be detected is used to represent a feature vector of an image block where the feature point is located, and the image block may be located in the image to be detected.

For example, the image to be measured may include a plurality of feature points, the feature points may be pixel positions having specificity in the image to be measured, the feature points may correspond to an image local descriptor, the image local descriptor is recorded as the local descriptor to be measured, the local descriptor to be measured describes the features of the image blocks in a range near the feature points (i.e., pixel positions) with a vector, and the vector may also be referred to as a descriptor of the feature points. To sum up, the local descriptor to be measured is a feature vector for representing an image block where the feature point is located.

The characteristic points can be extracted from the image to be detected by using algorithms such as ORB, SIFT, SURF and the like, and the local descriptors to be detected corresponding to the characteristic points are determined. A deep learning algorithm (such as SuperPoint, DELF, D2-Net, etc.) may also be used to extract feature points from the image to be detected and determine the local descriptor to be detected corresponding to the feature points, which is not limited to this, as long as the feature points can be obtained and the local descriptor to be detected can be determined.

Step 406, determining, for each feature point corresponding to the image to be measured, a distance, such as a euclidean distance, between the local descriptor to be measured corresponding to the feature point and the sample local descriptor corresponding to each map point corresponding to the candidate sample image corresponding to the image to be measured (i.e., the sample local descriptor corresponding to the map point corresponding to each feature point in the candidate sample image), that is, calculating the euclidean distance between the two feature vectors.

Referring to the above embodiment, for each frame of sample image, the three-dimensional visual map includes the sample local descriptor corresponding to each map point corresponding to the sample image, and therefore, after the candidate sample image corresponding to the image to be tested is obtained, the sample local descriptor corresponding to each map point corresponding to the candidate sample image is obtained from the three-dimensional visual map. After each feature point corresponding to the image to be detected is obtained, the distance between the local descriptor to be detected corresponding to the feature point and the sample local descriptor corresponding to each map point corresponding to the candidate sample image is determined.

Step 407, for each feature point, selecting a target map point from a plurality of map points corresponding to the candidate sample image based on the distance between the local descriptor to be detected corresponding to the feature point and the sample local descriptor corresponding to each map point corresponding to the candidate sample image; and the distance between the local descriptor to be detected and the sample local descriptor corresponding to the target map point is the minimum distance, and the minimum distance is smaller than the distance threshold.

For example, assuming that the candidate sample image corresponds to a map point 1, a map point 2, and a map point 3, a distance 1 between the local descriptor to be measured corresponding to the feature point and the sample local descriptor corresponding to the map point 1 may be calculated, a distance 2 between the local descriptor to be measured and the sample local descriptor corresponding to the map point 2 may be calculated, and a distance 3 between the local descriptor to be measured and the sample local descriptor corresponding to the map point 3 may be calculated.

In one possible implementation, if the distance 1 is the minimum distance, the map point 1 may be selected as the target map point. Alternatively, if the distance 1 is less than the distance threshold (which may be configured empirically), and the distance 2 is less than the distance threshold, but the distance 3 is not less than the distance threshold, then both the map point 1 and the map point 2 may be selected as the target map point. Or, if the distance 1 is the minimum distance and the distance 1 is smaller than the distance threshold, the map point 1 may be selected as the target map point, but if the distance 1 is the minimum distance and the distance 1 is not smaller than the distance threshold, the target map point cannot be selected, that is, the relocation fails.

In summary, for each feature point of the image to be detected, a target map point corresponding to the feature point is selected from the candidate sample image corresponding to the image to be detected, so as to obtain a matching relationship between the feature point and the target map point.

And 408, determining a global positioning pose in the three-dimensional visual map corresponding to the image to be detected based on the plurality of feature points corresponding to the image to be detected and the target map points corresponding to the plurality of feature points.

For a frame of image to be detected, the image to be detected may correspond to a plurality of feature points, and each feature point corresponds to a target map point, for example, the target map point corresponding to the feature point 1 is a map point 1, the target map point corresponding to the feature point 2 is a map point 2, and so on, so as to obtain a plurality of matching relationship pairs, each matching relationship pair includes a feature point (i.e., a two-dimensional feature point) and a map point (i.e., a three-dimensional map point in a three-dimensional visual map), the feature point represents a two-dimensional position in the image to be detected, and the map point represents a three-dimensional position in the three-dimensional visual map, that is, the matching relationship pair includes a mapping relationship from a two-dimensional position to a three-dimensional position, that is, a mapping relationship from a two-dimensional position in the image to be detected to a three-dimensional position in the three-dimensional visual map.

And if the total number of the plurality of matching relationship pairs does not meet the number requirement, the fact that the global positioning pose in the three-dimensional visual map corresponding to the image to be detected cannot be determined based on the plurality of matching relationship pairs is shown. If the total number of the plurality of matching relationship pairs reaches the number requirement (that is, the total number reaches a preset number value), it means that the global positioning pose in the three-dimensional visual map corresponding to the image to be measured can be determined based on the plurality of matching relationship pairs, and the global positioning pose in the three-dimensional visual map corresponding to the image to be measured can be determined based on the plurality of matching relationship pairs.

For example, a PnP (global NPoint, n-point Perspective) algorithm may be used to calculate the global positioning pose of the image to be measured in the three-dimensional visual map, and the calculation method is not limited. For example, the input data of the PnP algorithm is a plurality of matching relationship pairs, and for each matching relationship pair, the matching relationship pair includes a two-dimensional position in the image to be measured and a three-dimensional position in the three-dimensional visual map, and the pose, that is, the global positioning pose, of the image to be measured in the three-dimensional visual map can be calculated by using the PnP algorithm based on the plurality of matching relationship pairs.

In summary, for each frame of image to be detected, the global positioning pose in the three-dimensional visual map corresponding to the image to be detected is obtained, that is, the global positioning pose of the image to be detected in the three-dimensional visual map coordinate system is obtained.

In a possible implementation manner, after obtaining the plurality of matching relationship pairs, an effective matching relationship pair may be found from the plurality of matching relationship pairs. Based on the effective matching relation pairs, the global positioning pose of the image to be detected in the three-dimensional visual map can be calculated by adopting a PnP algorithm. For example, a RANdom SAmple Consensus (RANdom SAmple Consensus) detection algorithm may be adopted to find a valid pair of matching relationships from all pairs of matching relationships, which is not limited in this process.

And 409, generating a global positioning track of the terminal equipment in the three-dimensional visual map based on the global positioning poses corresponding to the M frames of images to be detected, wherein the global positioning track comprises a plurality of global positioning poses in the three-dimensional visual map. The server may obtain a global positioning track in the three-dimensional visual map, that is, a global positioning track in a coordinate system of the three-dimensional visual map, where the global positioning track may include global positioning poses corresponding to M frames of images to be detected, that is, the global positioning track may include M global positioning poses. Because the M frames of images to be detected are partial images selected from all the images, the global positioning track can include global positioning poses corresponding to a small number of images to be detected, that is, the server can obtain the global positioning track with a low frame rate.

Fifthly, fusion positioning of the server. After obtaining the self-positioning track with the high frame rate and the global positioning track with the low frame rate, the server fuses the self-positioning track with the high frame rate and the global positioning track with the low frame rate to obtain a fused positioning track with the high frame rate in a three-dimensional visual map coordinate system, namely a fused positioning track of the terminal equipment in the three-dimensional visual map. The fusion positioning track is a high frame rate pose in the three-dimensional visual map, the global positioning track is a low frame rate pose in the three-dimensional visual map, namely the frame rate of the fusion positioning track is higher than that of the global positioning track, and the number of the fusion positioning poses is larger than that of the global positioning poses.

Referring to fig. 5, a white solid line circle represents a self-positioning pose, and a trajectory formed by a plurality of self-positioning poses is referred to as a self-positioning trajectory, that is, the self-positioning trajectory includes a plurality of self-positioning poses. The self-positioning pose corresponding to the first frame image can be a reference coordinate system S_LThe coordinate origin of the (self-positioning coordinate system) records the self-positioning pose corresponding to the first frame image as

Self-positioning pose

And a reference coordinate system S_LCoincide. For each self-positioning pose in the self-positioning trajectory, is in a reference coordinate system S_LAnd (5) self-positioning pose.

The gray solid line circle represents a global positioning pose, and a track formed by a plurality of global positioning poses is called a global positioning track, namely the global positioning track comprises a plurality of global positioning poses which can be a three-dimensional visual map coordinate system S_GPose at, i.e. each global localization pose in the global localization track is a three-dimensional visual map coordinate system S_GAnd the global positioning pose is also the global positioning pose under the three-dimensional visual map.

The white dotted circle represents a fusion positioning pose, and a track formed by a plurality of fusion positioning poses is called a fusion positioning track, namely the fusion positioning track comprises a plurality of fusion positioning poses which can be a three-dimensional visual map coordinate system S_GPose at, i.e. each fused localization pose in the fused localization track is a three-dimensional visual map coordinate system S_GAnd the fusion positioning pose is also the fusion positioning pose under the three-dimensional visual map.

Referring to fig. 5, the target image includes multiple frames of images, each frame of image corresponds to a self-positioning pose, and a partial image is selected from the multiple frames of images as an image to be detected, and each frame of image to be detected corresponds to a global positioning pose, so that the number of self-positioning poses is greater than the number of global positioning poses. When the fusion positioning tracks are obtained based on the self-positioning tracks and the global positioning tracks, each self-positioning pose corresponds to one fusion positioning pose (namely, the self-positioning poses correspond to the fusion positioning poses one by one), namely, the number of the self-positioning poses is the same as that of the fusion positioning poses, and therefore, the number of the fusion positioning poses is larger than that of the global positioning poses.

In a possible implementation manner, the server may implement a track fusion function and a pose transformation function, as shown in fig. 6, the server may implement the track fusion function and the pose transformation function by the following steps to obtain a fusion positioning track of the terminal device in the three-dimensional visual map:

step 601, selecting N self-positioning poses corresponding to the target time period from all the self-positioning poses included in the self-positioning track, and selecting P global positioning poses corresponding to the target time period from all the global positioning poses included in the global positioning track, wherein N may be greater than P for example.

For example, when fusing the self-localization trajectory and the global localization trajectory of the target time period, N self-localization poses corresponding to the target time period (i.e., self-localization poses determined based on the images acquired by the target time period) may be determined, and P global localization poses corresponding to the target time period (i.e., global localization poses determined based on the images acquired by the target time period) may be determined, as shown in fig. 5, and the target time period may be merged with the global localization trajectory of the target time period

And

the self-positioning poses in the space are taken as N self-positioning poses corresponding to the target time period, and the self-positioning poses can be used for positioning the target time period

And

the global positioning poses in between are taken as P global positioning poses corresponding to the target time period.

Step 602, determining N fusion positioning poses corresponding to the N self-positioning poses based on the N self-positioning poses and the P global positioning poses, wherein the N self-positioning poses correspond to the N fusion positioning poses one by one.

For example, referring to FIG. 5, a self-localization pose may be determined based on N self-localization poses and P global localization poses

Corresponding fusion positioning pose

Determining self-location pose

Corresponding fusion positioning pose

Determining self-location pose

Corresponding fusion positioning pose

And so on.

In a possible implementation manner, it is assumed that N self-positioning poses, P global positioning poses, and N fusion positioning poses exist, where the N self-positioning poses are known values, the P global positioning poses are known values, and the N fusion positioning poses are unknown values and are pose values to be solved. Self-positioning pose as shown in FIG. 5

And fusion positioning pose

Corresponding, self-positioning pose

And fusion positioning pose

Corresponding, self-positioning pose

And fusion positioning pose

Correspondingly, and so on. Global positioning pose

And fusion positioning pose

Corresponding, global positioning pose

And fusion positioning pose

Correspondingly, and so on.

First constraint values may be determined based on the N self-localization poses and the N fused localization poses, the first constraint values being used to represent residual values between the fused localization poses and the self-localization poses, e.g., may be based on

And

the difference of,

And

…, the difference between the values of,

And

calculating a first constraint value. The calculation formula of the first constraint value is not limited in this embodiment, and may be related to the above difference values.

A second constraint value may be determined based on the P global positioning poses and the P fused positioning poses (i.e., P fused positioning poses corresponding to the P global positioning poses selected from the N fused positioning poses), the second constraint value being used to represent a residual value (i.e., an absolute difference value) between the fused positioning pose and the global positioning pose, e.g., may be based on

And

…, the difference between the values of,

And

and calculating a second constraint value. The formula for calculating the second constraint value is not limited in this embodiment, and may be related to the above difference values.

The target constraint value may be calculated based on the first constraint value and the second constraint value, e.g., the target constraint value may be the sum of the first constraint value and the second constraint value. Because the N self-positioning poses and the P global positioning poses are known values and the N fusion positioning poses are unknown values, the target constraint value is minimum by adjusting the values of the N fusion positioning poses. And when the target constraint value is minimum, the values of the N fusion positioning poses are the pose values finally solved, so that the values of the N fusion positioning poses are obtained.

In one possible implementation, the target constraint value may be calculated using equation (1):

in formula (1), f (t) represents the target constraint value, the part before the plus sign (hereinafter referred to as a first part) is the first constraint value, and the part after the plus sign (hereinafter referred to as a second part) is the second constraint value.

Ω_i,i+1The residual error information matrix aiming at the self-positioning pose can be configured according to experience without limitation, and omega_kThe residual error information matrix aiming at the global positioning pose can be configured according to experience, and the configuration is not limited.

The first part represents relative transformation constraint of the self-positioning pose and the fusion positioning pose and can reflect through a first constraint value, and N is all self-positioning poses in the self-positioning track, namely N self-positioning poses. The second part represents global positioning constraints of the global positioning pose and the fusion positioning pose, and can be reflected by a second constraint value, wherein P is all global positioning poses in the global positioning track, namely P global positioning poses.

For the first part and the second part, it can also be expressed by formula (2) and formula (3):

in the formula (2) and the formula (3),

and

to fuse localization poses (without a corresponding global localization pose),

and

in order to realize the self-positioning pose,

for relative pose change constraints between two self-positioning poses, e_i,i+1Is composed of

And

relative pose change and

the residual of the constraint.

For fusing position poses (with corresponding global position poses)

)，

Is composed of

Corresponding global positioning pose, e_kRepresenting fusion positioning poses

Positioning poses with respect to global

The residual error of (a).

Because the self-positioning pose and the global positioning pose are known, the fusion positioning pose is unknown, and the optimization goal can be to minimize the value of F (T), so that the fusion positioning pose can be obtained, namely the fusion positioning track under the three-dimensional visual map coordinate system can be shown in a formula (4): and (5) argmin F (T), wherein the fusion positioning track can be obtained by minimizing the value of F (T), and the fusion positioning track can comprise a plurality of fusion positioning poses.

Exemplarily, in order to minimize the value of f (t), algorithms such as gauss newton, gradient descent, LM (Levenberg-Marquardt) and the like may be used to solve to obtain the fused positioning pose, which is not described herein again.

And 603, generating a fusion positioning track of the terminal device in the three-dimensional visual map based on the N fusion positioning poses, wherein the fusion positioning track comprises the N fusion positioning poses in the three-dimensional visual map.

The server obtains a fusion positioning track in the three-dimensional visual map, namely the fusion positioning track in the coordinate system of the three-dimensional visual map, wherein the number of fusion positioning poses in the fusion positioning track is greater than that in the global positioning track, namely the fusion positioning track with a high frame rate can be obtained.

And 604, selecting an initial fusion positioning pose from the fusion positioning track, and selecting an initial self-positioning pose corresponding to the initial fusion positioning pose from the self-positioning track.

And 605, selecting a target self-positioning pose from the self-positioning track, and determining a target fusion positioning pose based on the initial fusion positioning pose, the initial self-positioning pose and the target self-positioning pose.

For example, after the fused positioning track is generated, the fused positioning track may be updated, and in the track updating process, an initial fused positioning pose may be selected from the fused positioning track, an initial self-positioning pose may be selected from the self-positioning track, and a target self-positioning pose may be selected from the self-positioning track. On this basis, a target fusion localization pose may be determined based on the initial fusion localization pose, the initial self-localization pose, and the target self-localization pose. A new fused localization track may then be generated based on the target fused localization pose and the fused localization track to replace the original fused localization track.

For example, in steps 601-603, referring to FIG. 5, the self-localizing track comprises

And

a global localization track comprises

And

the fused localization track comprises

And

after that, if a new self-positioning pose is obtained

However, since there is no corresponding global positioning pose, it cannot be based on the global positioning pose and the self-positioning pose

Determining self-positioning pose

Corresponding fusion positioning pose

On the basis, in the embodiment, the method can also be adoptedDetermining a fusion positioning pose according to the following formula (4)

In the formula (4), the first and second groups,

representing self-positioning poses

The corresponding fusion positioning pose, namely the target fusion positioning pose,

represents a fusion positioning pose, namely an initial fusion positioning pose selected from the fusion positioning track,

representing self-positioning poses, i.e. AND extracted from self-positioning trajectories

A corresponding initial self-positioning pose is obtained,

the representation represents a self-positioning pose, namely a self-positioning pose of a target selected from self-positioning tracks. In conclusion, the pose can be positioned based on the initial fusion

The initial self-positioning pose

And self-positioning pose of the target

Determining a target fusion location pose

Obtaining a target fusion positioning pose

Thereafter, a new fused localization track may be generated, i.e., the new fused localization track may include the target fused localization pose

Thereby updating the fused localization track.

In the above process, step 601-step 603 are track fusion processes, step 604-step 605 are pose transformation processes, and track fusion is a process of registering and fusing a self-positioning track and a global positioning track, so as to realize the conversion of the self-positioning track from a self-positioning coordinate system to a three-dimensional visual map coordinate system, modify the track by using a global positioning result, and perform track fusion once when a new frame can obtain the global positioning track. Because not all frames can successfully obtain the global positioning track, the poses of the frames are fused positioning poses of a three-dimensional visual map coordinate system output in a pose transformation mode, namely a pose transformation process.

And sixthly, three-dimensional visual maps of the target scenes. The three-dimensional visual map of the target scene needs to be constructed in advance, the three-dimensional visual map is stored in the server, and the server can display the track based on the three-dimensional visual map. The three-dimensional visual map is a 3D three-dimensional visual map of a target scene, is mainly used for track display, can be obtained through laser scanning and artificial modeling, is a visual map which can be checked, is not limited in the construction mode of the three-dimensional visual map, and can be obtained by adopting a composition algorithm.

The three-dimensional visual map and the three-dimensional visual map of the target scene need to be registered based on the three-dimensional visual map of the target scene and the three-dimensional visual map of the target scene, so that the three-dimensional visual map and the three-dimensional visual map are ensured to be aligned in space. For example, sampling a three-dimensional visual map, changing the three-dimensional visual map from a triangular patch form to a dense Point cloud form, and registering the Point cloud and a 3D Point cloud of the three-dimensional visual map through an ICP (Iterative Closest Point) algorithm to obtain a transformation matrix T from the three-dimensional visual map to the three-dimensional visual map; and finally, transforming the three-dimensional visual map to a three-dimensional visual map coordinate system by using the transformation matrix T to obtain the three-dimensional visual map aligned with the three-dimensional visual map.

For example, the transformation matrix T (denoted as target transformation matrix) may be determined as follows:

mode 1, when a three-dimensional visual map and a three-dimensional visual map are constructed, a plurality of calibration points (different calibration points can be distinguished through different shapes, so that the calibration points can be recognized from an image) can be deployed in a target scene, the three-dimensional visual map can comprise a plurality of calibration points, and the three-dimensional visual map can also comprise a plurality of calibration points. For each of a plurality of calibration points, a coordinate pair corresponding to the calibration point may be determined, the coordinate pair including a position coordinate of the calibration point in the three-dimensional visual map and a position coordinate of the calibration point in the three-dimensional visual map. The target transformation matrix can be determined based on the coordinate pairs corresponding to the plurality of calibration points. For example, the target transformation matrix T may be an m × n-dimensional transformation matrix, and the transformation relationship between the three-dimensional visual map and the three-dimensional visual map may be: w is Q × T, W represents the position coordinate in the three-dimensional visual map, and Q represents the position coordinate in the three-dimensional visual map, and then, a plurality of coordinate pairs corresponding to a plurality of calibration points are substituted into the above formula (i.e., the position coordinate of the calibration point in the three-dimensional visual map is taken as Q, and the position coordinate of the calibration point in the three-dimensional visual map is taken as W), so that a target transformation matrix T can be obtained, which is not described again.

Mode 2, acquiring an initial transformation matrix, mapping position coordinates in the three-dimensional visual map into mapping coordinates in the three-dimensional visual map based on the initial transformation matrix, and determining whether the initial transformation matrix is converged based on the relation between the mapping coordinates and actual coordinates in the three-dimensional visual map; if so, determining the initial transformation matrix as a target transformation matrix to obtain a target transformation matrix; if not, the initial transformation matrix can be adjusted, the adjusted transformation matrix is used as the initial transformation matrix, then, the operation of mapping the position coordinates in the three-dimensional visual map into the mapping coordinates in the three-dimensional visual map based on the initial transformation matrix is returned to be executed, and the like until the target transformation matrix is obtained.

For example, an initial transformation matrix may be obtained first, the obtaining method of the initial transformation matrix is not limited, and the initial transformation matrix may be an initial transformation matrix set randomly or an initial transformation matrix obtained by using a certain algorithm, where the initial transformation matrix is a matrix that needs iterative optimization, that is, the initial transformation matrix is continuously iteratively optimized, and the initial transformation matrix after iterative optimization is used as a target transformation matrix.

After the initial transformation matrix is obtained, the position coordinates in the three-dimensional visual map can be mapped to the mapping coordinates in the three-dimensional visual map based on the initial transformation matrix, for example, the transformation relationship between the three-dimensional visual map and the three-dimensional visual map may be: that is, the position coordinates in the three-dimensional visual map can be obtained by using the position coordinates in the three-dimensional visual map as Q and the initial transformation matrix as T (for the sake of convenience of distinction, these are referred to as mapping coordinates). Then, it is determined whether the initial transformation matrix has converged based on a relationship of the mapped coordinates in the three-dimensional visualization map and the actual coordinates in the three-dimensional visualization map. For example, the mapping coordinates in the three-dimensional visualization map are coordinates converted based on the initial transformation matrix, the actual coordinates in the three-dimensional visualization map are real coordinates in the three-dimensional visualization map, the smaller the difference between the mapping coordinates and the actual coordinates, the higher the accuracy of the initial transformation matrix is, and the larger the difference between the mapping coordinates and the actual coordinates, the lower the accuracy of the initial transformation matrix is. Based on the above principle, it can be determined whether the initial transformation matrix has converged based on the difference between the mapped coordinates and the actual coordinates.

For example, if the difference between the mapped coordinates and the actual coordinates (which may be the sum of multiple sets of differences, each set of differences corresponds to the difference between one mapped coordinate and the actual coordinates) is smaller than a threshold, it is determined that the initial transformation matrix has converged, and if the difference between the mapped coordinates and the actual coordinates is not smaller than the threshold, it is determined that the initial transformation matrix has not converged.

If the initial transformation matrix is not converged, the initial transformation matrix can be adjusted, the adjustment process is not limited, for example, the initial transformation matrix is adjusted by adopting an ICP (inductively coupled plasma) algorithm, the adjusted transformation matrix is used as the initial transformation matrix, the operation of mapping the position coordinates in the three-dimensional visual map into the mapping coordinates in the three-dimensional visual map based on the initial transformation matrix is returned, and the like until the target transformation matrix is obtained. If the initial transformation matrix has converged, the initial transformation matrix is determined as a target transformation matrix.

Mode 3, sampling the three-dimensional visual map to obtain a first point cloud corresponding to the three-dimensional visual map; and sampling the three-dimensional visual map to obtain a second point cloud corresponding to the three-dimensional visual map. And registering the first point cloud and the second point cloud by adopting an ICP (inductively coupled plasma) algorithm to obtain a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map. Obviously, the first point cloud and the second point cloud can be obtained, the first point cloud comprises a large number of 3D points, the second point cloud comprises a large number of 3D points, and based on the large number of 3D points of the first point cloud and the large number of 3D points of the second point cloud, the registration can be performed by using an ICP algorithm, and the registration process is not limited.

And seventhly, displaying the track. After the server obtains the fusion positioning tracks, the server can convert the fusion positioning poses into target positioning poses in the three-dimensional visual map based on a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map, and display the target positioning poses through the three-dimensional visual map. On the basis, a manager can open a Web browser and access a server through a network so as to check the target positioning poses displayed in the three-dimensional visual map, and the target positioning poses form a track. The server can display the target positioning pose of the terminal equipment to the three-dimensional visual map by reading and rendering the three-dimensional visual map, so that a manager can check the target positioning pose displayed in the three-dimensional visual map. The manager can change the viewing angle through mouse dragging, and 3D viewing of the track is achieved. For example, the server comprises client software, and the client software reads and renders the three-dimensional visual map and displays the target positioning pose to the three-dimensional visual map. On the basis, a user (such as an administrator) can access the client software through the Web browser to view the target positioning pose displayed in the three-dimensional visual map through the client software. Illustratively, when the target positioning pose displayed in the three-dimensional visual map is viewed through the client software, the viewing angle of the three-dimensional visual map can be changed through mouse dragging.

According to the technical scheme, the cloud-edge combined positioning and displaying method is provided, the terminal device calculates the self-positioning track with the high frame rate, only the self-positioning track and a small number of images to be detected are sent, and the data volume of network transmission is reduced. And global positioning is carried out on the server, so that the consumption of computing resources and the consumption of storage resources of the terminal equipment are reduced. By adopting a cloud-edge integrated system architecture, the computing pressure can be shared, the hardware cost of the terminal equipment is reduced, and the network transmission data volume is reduced. The final positioning result can be displayed in a three-dimensional visual map, and managers can access the server through a Web end to carry out interactive display.

Based on the same application concept as the method, the embodiment of the application provides a cloud side management system, the cloud side management system comprises a terminal device and a server, the server comprises a three-dimensional visual map of a target scene, wherein: the terminal device is used for acquiring a target image of a target scene and motion data of the terminal device in the moving process of the target scene, and determining a self-positioning track of the terminal device based on the target image and the motion data; if the target image comprises a plurality of frames of images, selecting a partial image from the plurality of frames of images as an image to be detected, and sending the image to be detected and the self-positioning track to a server; the server is used for generating a fused positioning track of the terminal equipment in the three-dimensional visual map based on the image to be detected and the self-positioning track, and the fused positioning track comprises a plurality of fused positioning poses; and aiming at each fusion positioning pose in the fusion positioning track, determining a target positioning pose corresponding to the fusion positioning pose, and displaying the target positioning pose.

Illustratively, the terminal device includes a vision sensor and a motion sensor; the vision sensor is used for acquiring a target image of the target scene, and the motion sensor is used for acquiring motion data of the terminal equipment; wherein the terminal device is a wearable device and the visual sensor and the motion sensor are deployed on the wearable device; or the terminal equipment is a recorder, and the vision sensor and the motion sensor are arranged on the recorder; or, the terminal device is a camera, and the vision sensor and the motion sensor are disposed on the camera.

For example, when the server generates the fused positioning track of the terminal device in the three-dimensional visual map based on the image to be detected and the self-positioning track, the server is specifically configured to:

determining a target map point corresponding to the image to be detected from the three-dimensional visual map, and determining a global positioning track of the terminal equipment in the three-dimensional visual map based on the target map point;

generating a fused positioning track of the terminal equipment in the three-dimensional visual map based on the self-positioning track and the global positioning track; the frame rate of the fusion positioning poses included by the fusion positioning track is greater than the frame rate of the global positioning poses included by the global positioning track; the frame rate of the fused localization poses included in the fused localization tracks is equal to the frame rate of the self-localization poses included in the self-localization tracks.

Illustratively, the server determines a target positioning pose corresponding to the fusion positioning pose, and displays the target positioning pose specifically for: converting the fusion positioning pose into a target positioning pose in a three-dimensional visual map based on a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map, and displaying the target positioning pose through the three-dimensional visual map;

the server comprises client software, and the client software reads and renders the three-dimensional visual map and displays the target positioning pose to the three-dimensional visual map;

accessing the client software through a Web browser by a user so as to check the target positioning pose displayed in the three-dimensional visual map through the client software;

when the target positioning pose displayed in the three-dimensional visual map is checked through the client software, the checking visual angle of the three-dimensional visual map is changed through mouse dragging.

Based on the same application concept as the method, the embodiment of the present application provides a pose display apparatus, which is applied to a server in a cloud edge management system, where the server includes a three-dimensional visual map of a target scene, as shown in fig. 7, and is a structure diagram of the pose display apparatus, and the pose display apparatus includes:

an obtaining module 71, configured to obtain an image to be detected and a self-positioning track; the self-positioning track is determined by terminal equipment based on a target image of the target scene and motion data of the terminal equipment, and the image to be detected is a partial image in a multi-frame image included in the target image; a generating module 72, configured to generate a fused positioning track of the terminal device in the three-dimensional visual map based on the image to be detected and the self-positioning track, where the fused positioning track includes multiple fused positioning poses; and the display module 73 is configured to determine, for each fusion positioning pose in the fusion positioning trajectory, a target positioning pose corresponding to the fusion positioning pose, and display the target positioning pose.

For example, the generating module 72 is specifically configured to, when generating the fused positioning track of the terminal device in the three-dimensional visual map based on the image to be detected and the self-positioning track: determining a target map point corresponding to the image to be detected from a three-dimensional visual map, and determining a global positioning track of the terminal equipment in the three-dimensional visual map based on the target map point; generating a fused positioning track of the terminal equipment in the three-dimensional visual map based on the self-positioning track and the global positioning track; the frame rate of the fusion positioning poses included in the fusion positioning track is greater than the frame rate of the global positioning poses included in the global positioning track; the frame rate of the fused localization poses included in the fused localization tracks is equal to the frame rate of the self-localization poses included in the self-localization tracks.

Illustratively, the three-dimensional visual map includes at least one of: a pose matrix corresponding to the sample image, a sample global descriptor corresponding to the sample image, a sample local descriptor corresponding to the characteristic point in the sample image and map point information; the generating module 72 determines a target map point corresponding to the image to be measured from the three-dimensional visual map, and when determining the global positioning track of the terminal device in the three-dimensional visual map based on the target map point, is specifically configured to: selecting candidate sample images from the multi-frame sample images according to the similarity between each frame of image to be detected and the multi-frame sample images corresponding to the three-dimensional visual map; acquiring a plurality of feature points from an image to be detected; for each feature point, determining a target map point corresponding to the feature point from a plurality of map points corresponding to the candidate sample image; determining a global positioning pose in a three-dimensional visual map corresponding to the image to be detected based on the plurality of feature points and target map points corresponding to the plurality of feature points; and generating a global positioning track of the terminal equipment in the three-dimensional visual map based on the global positioning poses corresponding to all the images to be measured.

For example, the generating module 72 is specifically configured to, when generating the fused localization track of the terminal device in the three-dimensional visual map based on the self-localization track and the global localization track: selecting N self-positioning poses corresponding to a target time period from all self-positioning poses included in a self-positioning track, and selecting P global positioning poses corresponding to the target time period from all global positioning poses included in the global positioning track; n is greater than P; determining N fusion positioning poses corresponding to the N self-positioning poses based on the N self-positioning poses and the P global positioning poses, wherein the N self-positioning poses correspond to the N fusion positioning poses one by one; and generating a fusion positioning track of the terminal equipment in the three-dimensional visual map based on the N fusion positioning poses.

For example, the display module 73 determines an object locating pose corresponding to the fusion locating pose, and displays the object locating pose specifically for: converting the fusion positioning pose into a target positioning pose in a three-dimensional visual map based on a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map, and displaying the target positioning pose through the three-dimensional visual map; the display module, the display module 73, is further configured to determine a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map by: for each of a plurality of calibration points, determining a coordinate pair corresponding to the calibration point, wherein the coordinate pair comprises a position coordinate of the calibration point in a three-dimensional visual map and a position coordinate of the calibration point in the three-dimensional visual map; determining the target transformation matrix based on the coordinate pairs corresponding to the plurality of calibration points; or acquiring an initial transformation matrix, mapping the position coordinates in the three-dimensional visual map into mapping coordinates in the three-dimensional visual map based on the initial transformation matrix, and determining whether the initial transformation matrix is converged based on the relation between the mapping coordinates and actual coordinates in the three-dimensional visual map; if so, determining the initial transformation matrix as the target transformation matrix; if not, adjusting the initial transformation matrix, taking the adjusted transformation matrix as the initial transformation matrix, and returning to execute the operation of mapping the position coordinates in the three-dimensional visual map into the mapping coordinates in the three-dimensional visual map based on the initial transformation matrix; or sampling the three-dimensional visual map to obtain a first point cloud corresponding to the three-dimensional visual map; sampling the three-dimensional visual map to obtain a second point cloud corresponding to the three-dimensional visual map; and registering the first point cloud and the second point cloud by adopting an ICP (inductively coupled plasma) algorithm to obtain a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map.

Based on the same application concept as the method, the embodiment of the present application provides a server, where the server may include: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is used for executing machine executable instructions to realize the pose display method disclosed by the above example of the application.

Based on the same application concept as the method, embodiments of the present application further provide a machine-readable storage medium, where a plurality of computer instructions are stored, and when the computer instructions are executed by a processor, the pose display method disclosed in the above example of the present application can be implemented.

The machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. The pose display method is applied to a cloud side management system, the cloud side management system comprises terminal equipment and a server, the server comprises a three-dimensional visual map of a target scene, and the pose display method comprises the following steps:

2. The method of claim 1, wherein the terminal device determines a self-localization trajectory of the terminal device based on the target image and the motion data, comprising:

the terminal equipment traverses the current frame image from the multi-frame image; determining a self-positioning pose corresponding to the current frame image based on a self-positioning pose corresponding to a K frame image in front of the current frame image, a map position of the terminal equipment in a self-positioning coordinate system and the motion data; generating a self-positioning track of the terminal equipment in a self-positioning coordinate system based on the self-positioning poses corresponding to the multi-frame images;

if the current frame image is a key image, generating a map position in a self-positioning coordinate system based on the current position of the terminal equipment; and if the number of the matched feature points between the current frame image and the previous frame image of the current frame image does not reach a preset threshold value, determining that the current frame image is the key image.

3. The method of claim 1,

the server generates a fusion positioning track of the terminal equipment in the three-dimensional visual map based on the image to be detected and the self-positioning track, and the fusion positioning track comprises:

4. The method of claim 3, wherein the three-dimensional visual map comprises at least one of: a pose matrix corresponding to the sample image, a sample global descriptor corresponding to the sample image, a sample local descriptor corresponding to the characteristic point in the sample image and map point information; the server determines a target map point corresponding to the image to be detected from the three-dimensional visual map, and determines a global positioning track of the terminal equipment in the three-dimensional visual map based on the target map point, wherein the steps include:

aiming at each frame of image to be detected, the server selects a candidate sample image from the multi-frame sample images based on the similarity between the image to be detected and the multi-frame sample images corresponding to the three-dimensional visual map;

the server acquires a plurality of feature points from the image to be detected; for each feature point, determining a target map point corresponding to the feature point from a plurality of map points corresponding to the candidate sample image;

determining a global positioning pose in the three-dimensional visual map corresponding to the image to be detected based on the plurality of feature points and target map points corresponding to the plurality of feature points; and generating a global positioning track of the terminal equipment in the three-dimensional visual map based on the global positioning poses corresponding to all the images to be measured.

5. The method of claim 3,

the server generates a fused positioning track of the terminal device in the three-dimensional visual map based on the self-positioning track and the global positioning track, and the fused positioning track comprises:

the server selects N self-positioning poses corresponding to a target time period from all self-positioning poses included in the self-positioning track, and selects P global positioning poses corresponding to the target time period from all global positioning poses included in the global positioning track; wherein said N is greater than said P;

determining N fusion positioning poses corresponding to the N self-positioning poses based on the N self-positioning poses and the P global positioning poses, wherein the N self-positioning poses correspond to the N fusion positioning poses one by one;

and generating a fusion positioning track of the terminal equipment in the three-dimensional visual map based on the N fusion positioning poses.

6. The method of claim 1,

the server determines a target positioning pose corresponding to the fusion positioning pose and displays the target positioning pose, including: converting the fusion positioning pose into a target positioning pose in a three-dimensional visual map based on a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map, and displaying the target positioning pose through the three-dimensional visual map; the determination mode of the target transformation matrix between the three-dimensional visual map and the three-dimensional visual map comprises the following steps:

for each of a plurality of calibration points, determining a coordinate pair corresponding to the calibration point, wherein the coordinate pair comprises a position coordinate of the calibration point in the three-dimensional visual map and a position coordinate of the calibration point in the three-dimensional visual map; determining a target transformation matrix based on the coordinate pairs corresponding to the plurality of calibration points;

or acquiring an initial transformation matrix, mapping the position coordinates in the three-dimensional visual map into mapping coordinates in the three-dimensional visual map based on the initial transformation matrix, and determining whether the initial transformation matrix is converged based on the relation between the mapping coordinates and actual coordinates in the three-dimensional visual map; if so, determining the initial transformation matrix as a target transformation matrix; if not, adjusting the initial transformation matrix, taking the adjusted transformation matrix as the initial transformation matrix, and returning to execute the operation of mapping the position coordinates in the three-dimensional visual map into the mapping coordinates in the three-dimensional visual map based on the initial transformation matrix;

or sampling the three-dimensional visual map to obtain a first point cloud corresponding to the three-dimensional visual map; sampling the three-dimensional visual map to obtain a second point cloud corresponding to the three-dimensional visual map; and registering the first point cloud and the second point cloud by adopting an ICP (inductively coupled plasma) algorithm to obtain a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map.

7. The cloud edge management system is characterized by comprising a terminal device and a server, wherein the server comprises a three-dimensional visual map of a target scene, and the server comprises:

8. The system of claim 7, wherein the terminal device comprises a vision sensor and a motion sensor; the vision sensor is used for acquiring a target image of the target scene, and the motion sensor is used for acquiring motion data of the terminal equipment;

wherein the terminal device is a wearable device and the visual sensor and the motion sensor are deployed on the wearable device; or the terminal equipment is a recorder, and the vision sensor and the motion sensor are arranged on the recorder; or, the terminal device is a camera, and the vision sensor and the motion sensor are disposed on the camera.

9. The system of claim 7,

the server is specifically configured to, when generating the fusion positioning track of the terminal device in the three-dimensional visual map based on the image to be detected and the self-positioning track:

10. The system of claim 7,

the server determines a target positioning pose corresponding to the fusion positioning pose, and is specifically configured to, when displaying the target positioning pose: converting the fusion positioning pose into a target positioning pose in a three-dimensional visual map based on a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map, and displaying the target positioning pose through the three-dimensional visual map;

11. A pose display device applied to a server in a cloud edge management system, wherein the server comprises a three-dimensional visual map of a target scene, and the device comprises:

12. The apparatus according to claim 11, wherein the generating module, when generating the fused localization track of the terminal device in the three-dimensional visual map based on the image to be detected and the self-localization track, is specifically configured to: determining a target map point corresponding to the image to be detected from the three-dimensional visual map, and determining a global positioning track of the terminal equipment in the three-dimensional visual map based on the target map point; generating a fused positioning track of the terminal equipment in the three-dimensional visual map based on the self-positioning track and the global positioning track; the frame rate of the fusion positioning poses included in the fusion positioning track is greater than the frame rate of the global positioning poses included in the global positioning track; the frame rate of the fusion positioning poses included in the fusion positioning tracks is equal to the frame rate of the self-positioning poses included in the self-positioning tracks;

wherein the three-dimensional visual map comprises at least one of: a pose matrix corresponding to the sample image, a sample global descriptor corresponding to the sample image, a sample local descriptor corresponding to the characteristic point in the sample image and map point information; the generation module determines a target map point corresponding to the image to be detected from the three-dimensional visual map, and is specifically configured to, when determining a global positioning track of the terminal device in the three-dimensional visual map based on the target map point: selecting candidate sample images from the multi-frame sample images based on the similarity between the image to be detected and the multi-frame sample images corresponding to the three-dimensional visual map; acquiring a plurality of feature points from an image to be detected; for each feature point, determining a target map point corresponding to the feature point from a plurality of map points corresponding to the candidate sample image; determining a global positioning pose in a three-dimensional visual map corresponding to the image to be detected based on the plurality of feature points and target map points corresponding to the plurality of feature points; generating a global positioning track of the terminal equipment in the three-dimensional visual map based on the global positioning poses corresponding to all the images to be detected;

the generation module is specifically configured to, when generating a fusion localization track of the terminal device in the three-dimensional visual map based on the self-localization track and the global localization track: selecting N self-positioning poses corresponding to a target time period from all self-positioning poses included in the self-positioning track, and selecting P global positioning poses corresponding to the target time period from all global positioning poses included in the global positioning track; n is greater than P; determining N fusion positioning poses corresponding to the N self-positioning poses based on the N self-positioning poses and the P global positioning poses, wherein the N self-positioning poses correspond to the N fusion positioning poses one by one; generating a fusion positioning track of the terminal equipment in the three-dimensional visual map based on the N fusion positioning poses;

the display module determines a target positioning pose corresponding to the fusion positioning pose, and is specifically configured to: converting the fusion positioning pose into a target positioning pose in a three-dimensional visual map based on a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map, and displaying the target positioning pose through the three-dimensional visual map; the display module is further configured to determine a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map by using the following method: for each of a plurality of calibration points, determining a coordinate pair corresponding to the calibration point, wherein the coordinate pair comprises a position coordinate of the calibration point in the three-dimensional visual map and a position coordinate of the calibration point in the three-dimensional visual map; determining the target transformation matrix based on the coordinate pairs corresponding to the plurality of calibration points; or acquiring an initial transformation matrix, mapping the position coordinates in the three-dimensional visual map into mapping coordinates in the three-dimensional visual map based on the initial transformation matrix, and determining whether the initial transformation matrix is converged based on the relation between the mapping coordinates and actual coordinates in the three-dimensional visual map; if so, determining the initial transformation matrix as the target transformation matrix; if not, adjusting the initial transformation matrix, taking the adjusted transformation matrix as the initial transformation matrix, and returning to execute the operation of mapping the position coordinates in the three-dimensional visual map into the mapping coordinates in the three-dimensional visual map based on the initial transformation matrix; or sampling the three-dimensional visual map to obtain a first point cloud corresponding to the three-dimensional visual map; sampling the three-dimensional visual map to obtain a second point cloud corresponding to the three-dimensional visual map; and registering the first point cloud and the second point cloud by adopting an ICP (inductively coupled plasma) algorithm to obtain a target transformation matrix between the three-dimensional visual map and the three-dimensional visual map.